maxsegan/movenet-332

Name: maxsegan/movenet-332
Creator: maxsegan
Published: 2026-02-26 19:52:20
License: 暂无描述

Hugging Face2026-02-26 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/maxsegan/movenet-332

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: cc-by-4.0 task_categories: - robotics tags: - pose-estimation - 3d-pose - joint-angles - humanoid - kinetics-700 - VLA size_categories: - 100K<n<1M configs: - config_name: default data_files: - split: train path: "train-*.parquet" - split: val path: "val-*.parquet" --- # MoveNet-332: 3D Human Pose & Joint Angles from Kinetics-700 A large-scale dataset of 331,656 video clips from Kinetics-700 processed through a 2D-to-3D pose estimation pipeline (ViTPose + MotionAGFormer). Each clip includes 3D skeleton positions (H36M 17-joint format), 2D keypoints, pre-computed 22-DoF joint angles, and VLM-generated imperative action instructions. Designed for training Vision-Language-Action (VLA) models for humanoid robots, following approaches similar to GR00T N1 and Helix. ## Key Features - **331,656 clips** across 700+ action classes from Kinetics-700 - **3D poses** in H36M 17-joint format (typically 300 frames per clip at ~10 fps) - **22-DoF joint angles** pre-computed via inverse kinematics - **VLM-generated instructions** — imperative captions describing the movements - **Quality metadata** — tracking confidence, hard cut detection, quality scores ## Important: Videos Not Included This dataset contains **pose data and metadata only**. Videos must be obtained separately from the [Kinetics-700 dataset](https://github.com/cvdfoundation/kinetics-dataset). The `youtube_id`, `time_start`, and `time_end` columns allow matching clips back to their source videos. ## Schema | Column | Type | Description | |--------|------|-------------| | `clip_id` | string | NPZ filename stem (e.g., `a4UOhPiV4QE_000675_000685`) | | `action_class` | string | Kinetics-700 action class name | | `youtube_id` | string | YouTube video ID | | `time_start` | string | Start timestamp from clip ID | | `time_end` | string | End timestamp from clip ID | | `split` | string | `"train"` or `"val"` (98/2 hash-based split) | | `instruction` | string | VLM-generated imperative action caption | | `fps` | float32 | Original video FPS | | `num_pose_frames` | int32 | Number of pose frames (typically 300) | | `video_width` | int32 | Video width in pixels | | `video_height` | int32 | Video height in pixels | | `pose3d` | bytes | zlib-compressed float32 `[F, 17, 3]` | | `keypoints2d` | bytes | zlib-compressed float32 `[F, 17, 2]` | | `scores2d` | bytes | zlib-compressed float32 `[F, 17]` | | `bboxes` | bytes | zlib-compressed float32 `[F, 4]` | | `joint_angles` | bytes | zlib-compressed float32 `[F, 22]` | | `frame_indices` | bytes | zlib-compressed int32 `[F]` | | `tracking_confidence` | bytes | zlib-compressed float32 `[F]` | | `has_hard_cuts` | bool | Whether hard cuts were detected | | `quality` | float32 | Pose quality score (0–1) | Array columns are stored as zlib-compressed raw numpy bytes (variable-length across clips due to different frame counts). ## Loading the Dataset ### Basic usage with HuggingFace datasets ```python from datasets import load_dataset ds = load_dataset("YOUR_ORG/movenet-332", split="train") print(ds[0]["clip_id"], ds[0]["action_class"], ds[0]["instruction"]) ``` ### Deserializing array columns ```python import zlib import numpy as np row = ds[0] F = row["num_pose_frames"] # 3D poses: [F, 17, 3] pose3d = np.frombuffer(zlib.decompress(row["pose3d"]), dtype=np.float32).reshape(F, 17, 3) # Joint angles: [F, 22] joint_angles = np.frombuffer(zlib.decompress(row["joint_angles"]), dtype=np.float32).reshape(F, 22) # 2D keypoints: [F, 17, 2] kp2d = np.frombuffer(zlib.decompress(row["keypoints2d"]), dtype=np.float32).reshape(F, 17, 2) # Bounding boxes: [F, 4] as [x1, y1, x2, y2] bboxes = np.frombuffer(zlib.decompress(row["bboxes"]), dtype=np.float32).reshape(F, 4) # Frame indices (maps pose frames to video frames): [F] frame_indices = np.frombuffer(zlib.decompress(row["frame_indices"]), dtype=np.int32) # Tracking confidence: [F] confidence = np.frombuffer(zlib.decompress(row["tracking_confidence"]), dtype=np.float32) ``` ### Loading with PyArrow directly (for large-scale processing) ```python import pyarrow.parquet as pq table = pq.read_table("train-00000-of-00010.parquet") df = table.to_pandas() ``` ## Joint Angle Normalization Joint angle deltas can be normalized to `[-1, 1]` using the provided statistics in `metadata/action_delta_stats.json`: ```python import json import numpy as np with open("metadata/action_delta_stats.json") as f: stats = json.load(f) action_min = np.array(stats["min"], dtype=np.float32) action_max = np.array(stats["max"], dtype=np.float32) action_range = np.maximum(action_max - action_min, 1e-6) # Compute deltas from current pose current_angles = joint_angles[0] # [22] - current proprioception deltas = joint_angles - current_angles # [F, 22] # Normalize to [-1, 1] normalized = 2.0 * (deltas - action_min) / action_range - 1.0 ``` ## Skeleton Format Uses the Human3.6M 17-joint skeleton. See `metadata/joint_definitions.json` for the full joint hierarchy, bone connections, and DoF layout. **Joints:** Hip, R_Hip, R_Knee, R_Ankle, L_Hip, L_Knee, L_Ankle, Spine, Thorax, Neck, Head, L_Shoulder, L_Elbow, L_Wrist, R_Shoulder, R_Elbow, R_Wrist **22 DoF:** Spine (3), L_Hip (3), L_Knee (1), R_Hip (3), R_Knee (1), L_Shoulder (3), L_Elbow (1), R_Shoulder (3), R_Elbow (1), Neck (3) ## Obtaining Kinetics-700 Videos To pair this pose data with the original videos, download Kinetics-700: 1. Clone the downloader: `git clone https://github.com/cvdfoundation/kinetics-dataset` 2. Download using `youtube_id` and time range from each clip 3. Match clips via the `clip_id` column (format: `{youtube_id}_{time_start}_{time_end}`) ## Processing Pipeline 1. **Person Detection & Tracking** — YOLO + ByteTrack for consistent person tracking 2. **2D Pose Estimation** — ViTPose-H (top-down) for 17 COCO keypoints, converted to H36M format 3. **3D Pose Lifting** — MotionAGFormer with temporal context for 2D→3D lifting 4. **Quality Filtering** — Density checks, dynamic movement checks, hard cut detection 5. **Joint Angle Computation** — Inverse kinematics from 3D positions to 22 DoF ## Citation If you use this dataset, please cite: ```bibtex @dataset{movenet332, title={MoveNet-332: 3D Human Pose and Joint Angles from Kinetics-700}, year={2025}, url={https://huggingface.co/datasets/YOUR_ORG/movenet-332} } ``` ## License The pose annotations and metadata in this dataset are released under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). The underlying Kinetics-700 videos are subject to their own license terms.

提供机构：

maxsegan

5,000+

优质数据集

54 个

任务类型

进入经典数据集