maxsegan/movenet-332
收藏Hugging Face2026-02-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/maxsegan/movenet-332
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-4.0
task_categories:
- robotics
tags:
- pose-estimation
- 3d-pose
- joint-angles
- humanoid
- kinetics-700
- VLA
size_categories:
- 100K<n<1M
configs:
- config_name: default
data_files:
- split: train
path: "train-*.parquet"
- split: val
path: "val-*.parquet"
---
# MoveNet-332: 3D Human Pose & Joint Angles from Kinetics-700
A large-scale dataset of 331,656 video clips from Kinetics-700 processed through a 2D-to-3D pose estimation pipeline (ViTPose + MotionAGFormer). Each clip includes 3D skeleton positions (H36M 17-joint format), 2D keypoints, pre-computed 22-DoF joint angles, and VLM-generated imperative action instructions.
Designed for training Vision-Language-Action (VLA) models for humanoid robots, following approaches similar to GR00T N1 and Helix.
## Key Features
- **331,656 clips** across 700+ action classes from Kinetics-700
- **3D poses** in H36M 17-joint format (typically 300 frames per clip at ~10 fps)
- **22-DoF joint angles** pre-computed via inverse kinematics
- **VLM-generated instructions** — imperative captions describing the movements
- **Quality metadata** — tracking confidence, hard cut detection, quality scores
## Important: Videos Not Included
This dataset contains **pose data and metadata only**. Videos must be obtained separately from the [Kinetics-700 dataset](https://github.com/cvdfoundation/kinetics-dataset). The `youtube_id`, `time_start`, and `time_end` columns allow matching clips back to their source videos.
## Schema
| Column | Type | Description |
|--------|------|-------------|
| `clip_id` | string | NPZ filename stem (e.g., `a4UOhPiV4QE_000675_000685`) |
| `action_class` | string | Kinetics-700 action class name |
| `youtube_id` | string | YouTube video ID |
| `time_start` | string | Start timestamp from clip ID |
| `time_end` | string | End timestamp from clip ID |
| `split` | string | `"train"` or `"val"` (98/2 hash-based split) |
| `instruction` | string | VLM-generated imperative action caption |
| `fps` | float32 | Original video FPS |
| `num_pose_frames` | int32 | Number of pose frames (typically 300) |
| `video_width` | int32 | Video width in pixels |
| `video_height` | int32 | Video height in pixels |
| `pose3d` | bytes | zlib-compressed float32 `[F, 17, 3]` |
| `keypoints2d` | bytes | zlib-compressed float32 `[F, 17, 2]` |
| `scores2d` | bytes | zlib-compressed float32 `[F, 17]` |
| `bboxes` | bytes | zlib-compressed float32 `[F, 4]` |
| `joint_angles` | bytes | zlib-compressed float32 `[F, 22]` |
| `frame_indices` | bytes | zlib-compressed int32 `[F]` |
| `tracking_confidence` | bytes | zlib-compressed float32 `[F]` |
| `has_hard_cuts` | bool | Whether hard cuts were detected |
| `quality` | float32 | Pose quality score (0–1) |
Array columns are stored as zlib-compressed raw numpy bytes (variable-length across clips due to different frame counts).
## Loading the Dataset
### Basic usage with HuggingFace datasets
```python
from datasets import load_dataset
ds = load_dataset("YOUR_ORG/movenet-332", split="train")
print(ds[0]["clip_id"], ds[0]["action_class"], ds[0]["instruction"])
```
### Deserializing array columns
```python
import zlib
import numpy as np
row = ds[0]
F = row["num_pose_frames"]
# 3D poses: [F, 17, 3]
pose3d = np.frombuffer(zlib.decompress(row["pose3d"]), dtype=np.float32).reshape(F, 17, 3)
# Joint angles: [F, 22]
joint_angles = np.frombuffer(zlib.decompress(row["joint_angles"]), dtype=np.float32).reshape(F, 22)
# 2D keypoints: [F, 17, 2]
kp2d = np.frombuffer(zlib.decompress(row["keypoints2d"]), dtype=np.float32).reshape(F, 17, 2)
# Bounding boxes: [F, 4] as [x1, y1, x2, y2]
bboxes = np.frombuffer(zlib.decompress(row["bboxes"]), dtype=np.float32).reshape(F, 4)
# Frame indices (maps pose frames to video frames): [F]
frame_indices = np.frombuffer(zlib.decompress(row["frame_indices"]), dtype=np.int32)
# Tracking confidence: [F]
confidence = np.frombuffer(zlib.decompress(row["tracking_confidence"]), dtype=np.float32)
```
### Loading with PyArrow directly (for large-scale processing)
```python
import pyarrow.parquet as pq
table = pq.read_table("train-00000-of-00010.parquet")
df = table.to_pandas()
```
## Joint Angle Normalization
Joint angle deltas can be normalized to `[-1, 1]` using the provided statistics in `metadata/action_delta_stats.json`:
```python
import json
import numpy as np
with open("metadata/action_delta_stats.json") as f:
stats = json.load(f)
action_min = np.array(stats["min"], dtype=np.float32)
action_max = np.array(stats["max"], dtype=np.float32)
action_range = np.maximum(action_max - action_min, 1e-6)
# Compute deltas from current pose
current_angles = joint_angles[0] # [22] - current proprioception
deltas = joint_angles - current_angles # [F, 22]
# Normalize to [-1, 1]
normalized = 2.0 * (deltas - action_min) / action_range - 1.0
```
## Skeleton Format
Uses the Human3.6M 17-joint skeleton. See `metadata/joint_definitions.json` for the full joint hierarchy, bone connections, and DoF layout.
**Joints:** Hip, R_Hip, R_Knee, R_Ankle, L_Hip, L_Knee, L_Ankle, Spine, Thorax, Neck, Head, L_Shoulder, L_Elbow, L_Wrist, R_Shoulder, R_Elbow, R_Wrist
**22 DoF:** Spine (3), L_Hip (3), L_Knee (1), R_Hip (3), R_Knee (1), L_Shoulder (3), L_Elbow (1), R_Shoulder (3), R_Elbow (1), Neck (3)
## Obtaining Kinetics-700 Videos
To pair this pose data with the original videos, download Kinetics-700:
1. Clone the downloader: `git clone https://github.com/cvdfoundation/kinetics-dataset`
2. Download using `youtube_id` and time range from each clip
3. Match clips via the `clip_id` column (format: `{youtube_id}_{time_start}_{time_end}`)
## Processing Pipeline
1. **Person Detection & Tracking** — YOLO + ByteTrack for consistent person tracking
2. **2D Pose Estimation** — ViTPose-H (top-down) for 17 COCO keypoints, converted to H36M format
3. **3D Pose Lifting** — MotionAGFormer with temporal context for 2D→3D lifting
4. **Quality Filtering** — Density checks, dynamic movement checks, hard cut detection
5. **Joint Angle Computation** — Inverse kinematics from 3D positions to 22 DoF
## Citation
If you use this dataset, please cite:
```bibtex
@dataset{movenet332,
title={MoveNet-332: 3D Human Pose and Joint Angles from Kinetics-700},
year={2025},
url={https://huggingface.co/datasets/YOUR_ORG/movenet-332}
}
```
## License
The pose annotations and metadata in this dataset are released under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). The underlying Kinetics-700 videos are subject to their own license terms.
提供机构:
maxsegan



