azaracla/community_dataset_v1
收藏Hugging Face2026-04-05 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/azaracla/community_dataset_v1
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
tags:
- robotics
- smolvla
- community
- vlab
- so100
- manipulation
- lerobot
- vision-language-action
- embodied-ai
task_categories:
- robotics
language:
- en
size_categories:
- 100K<n<1M
pretty_name: Community Dataset v1 (v3.0)
---
# Community Dataset v1 (v3.0)
A large-scale community-contributed robotics dataset for vision-language-action learning, featuring **119 datasets** from **52 contributors** worldwide. This is a converted and curated version of the original [HuggingFaceVLA/community_dataset_v1](https://huggingface.co/datasets/HuggingFaceVLA/community_dataset_v1), upgraded to LeRobot v3.0 format.
This dataset was used to pretrain [SmolVLA](https://huggingface.co/lerobot/smolvla_base). It was filtered using specific criteria including fps, minimum number of episodes, and qualitative assessment of video quality, using the [FilterLeRobotData tool](https://huggingface.co/spaces/Beegbrain/FilterLeRobotData).
## 🌟 Overview
This dataset represents a collaborative effort from the robotics and AI community to build comprehensive training data for embodied AI systems. Each contribution contains demonstrations of robotic manipulation tasks with the SO100 arm, recorded using [LeRobot tools](https://github.com/huggingface/lerobot), primarily focused on tabletop scenarios and everyday object interactions.
## 📊 Dataset Statistics
| Metric | Value |
|--------|-------|
| **Total Datasets** | 119 |
| **Total Episodes** | 9,528 |
| **Total Frames** | 4,489,949 |
| **Contributors** | 52 |
| **Average FPS** | 30 |
| **Average Episodes per Dataset** | 80 |
| **Primary Tasks** | Manipulation, Pick & Place, Sorting |
| **Robot Types** | SO-100 (various colors) |
| **Data Format** | LeRobot v3.0 dataset format |
| **Total Size** | ~107 GB |
## 🗂️ Structure
The dataset maintains a clear hierarchical structure:
```
community_dataset_v1/
├── contributor1/
│ ├── dataset_name_1/
│ │ ├── data/ # Parquet files with observations
│ │ ├── videos/ # MP4 recordings
│ │ └── meta/ # Metadata and info
│ └── dataset_name_2/
├── contributor2/
│ └── dataset_name_3/
└── ...
```
Each dataset follows the LeRobot v3.0 format standard, ensuring compatibility with existing frameworks and easy integration.
## 🚀 Usage
**1. Authenticate with Hugging Face**
You need to be logged in to access the dataset:
```bash
# Login to Hugging Face
huggingface-cli login
# Or alternatively, set your token as an environment variable
# export HF_TOKEN=your_token_here
```
Get your token from [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
### Download the Dataset
```bash
hf download username/community_dataset_v1 \
--repo-type=dataset \
--local-dir /path/local_dir/community_dataset_v1
```
### Load Individual Datasets
```python
from lerobot.datasets.lerobot_dataset import LeRobotDataset
import os
# Browse available datasets
for contributor in os.listdir("./community_dataset_v1"):
contributor_path = f"./community_dataset_v1/{contributor}"
if os.path.isdir(contributor_path):
for dataset in os.listdir(contributor_path):
print(f"📁 {contributor}/{dataset}")
# Load a specific dataset (requires authentication)
dataset = LeRobotDataset(
repo_id="local",
root="./community_dataset_v1/contributor_name/dataset_name"
)
# Access episodes and observations
print(f"Episodes: {len(dataset.episode_indices)}")
print(f"Total frames: {len(dataset)}")
```
### Integration with SmolVLA pretraining framework
This dataset is designed for training VLA models. You can download this dataset and use it for Vision Language Action Models training framework, [VLAb](https://github.com/huggingface/VLAb/tree/main):
1. Visit the VLAb repository.
2. Follow the training instructions in the repo
3. Point the training script to this dataset
```python
accelerate launch --config_file accelerate_configs/multi_gpu.yaml \
src/lerobot/scripts/train.py \
--policy.type=smolvla2 \
--policy.repo_id=HuggingFaceTB/SmolVLM2-500M-Video-Instruct \
--dataset.repo_id="username/community_dataset_v1/AndrejOrsula/lerobot_double_ball_stacking_random,username/community_dataset_v1/aimihat/so100_tape" \
--dataset.root="local/path/to/datasets" \
--dataset.video_backend=pyav \
--dataset.features_version=2 \
--output_dir="./outputs/training" \
--batch_size=8 \
--steps=200000 \
--wandb.enable=true \
--wandb.project="smolvla2-training"
```
## 🔧 Dataset Format (v3.0)
Each dataset contains:
- **`data/`**: Parquet files with timestamped observations
- Robot states (joint positions, velocities)
- Action sequences
- Camera observations (multiple views)
- Language instructions
- **`videos/`**: Synchronized video recordings
- Multiple camera angles
- High-resolution capture
- Timestamp alignment
- **`meta/`**: Metadata and configuration
- Dataset info (fps, episode count)
- Robot configuration
- Task descriptions
### Key Differences from v2.1
- **Unified data files**: Episodes are concatenated into fewer parquet files (improved I/O)
- **Restructured metadata**: Episodes and stats stored in Parquet format instead of JSONL
- **Improved video organization**: Videos reorganized by camera key for better streaming
## 🎯 Intended Use
This dataset is designed for:
- **Vision-Language-Action (VLA) model training**
- **Robotic manipulation research**
- **Imitation learning experiments**
- **Multi-task policy development**
- **Embodied AI research**
## 🤝 Community Contributions
This dataset exists thanks to the generous contributions from researchers, hobbyists, and institutions worldwide. Each dataset represents hours of careful data collection and curation.
### Contributing Guidelines
Future contributions should follow:
- LeRobot v3.0 dataset format
- Consistent naming conventions for the features, camera views etc.
- Quality validation checks
- Proper task descriptions, describing the actions precisely.
Check the [blogpost](https://huggingface.co/blog/lerobot-datasets) for more information
## 🔗 Related Work
- [VLAb Framework](https://github.com/huggingface/VLAb)
- [SmolVLA model](https://huggingface.co/lerobot/smolvla_base)
- [SmolVLA Blogpost](https://huggingface.co/blog/smolvla)
- [SmolVLA Paper](https://huggingface.co/papers/2506.01844)
- [Docs](https://huggingface.co/docs/lerobot/smolvla)
- [How to Build a successful Robotics dataset with Lerobot?](https://huggingface.co/blog/lerobot-datasets)
- [Original Community Dataset v1 (v2.1)](https://huggingface.co/datasets/HuggingFaceVLA/community_dataset_v1)
---
*Converted and curated with ❤️ by the LeRobot Community*
提供机构:
azaracla



