mingjiexie/action100m-preview
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/mingjiexie/action100m-preview
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: fair-noncommercial-research-license
size_categories:
- 10M<n<100M
task_categories:
- video-classification
- video-text-to-text
tags:
- video
- action
arxiv: 2601.10592
---
# Action100M: A Large-scale Video Action Dataset
[**Paper**](https://huggingface.co/papers/2601.10592) | [**GitHub**](https://github.com/facebookresearch/Action100M)
Action100M is a large-scale dataset constructed from 1.2M Internet instructional videos (14.6 years of duration), yielding ~100 million temporally localized segments with open-vocabulary action supervision and rich captions. It serves as a foundation for scalable research in video understanding and world modeling.
## Load Action100M Annotations
Our data can be loaded from the 🤗 huggingface repo at [`facebook/action100m-preview`](https://huggingface.co/datasets/facebook/action100m-preview) where we released 10% of the full Action100M for preview. For examples of loading from local parquet files (from cloned repo) and visualization, see the [GitHub repo](https://github.com/facebookresearch/action100m).
```python
from datasets import load_dataset
dataset = load_dataset(
"parquet",
data_files=f"hf://datasets/facebook/Action100M-preview/data/*.parquet",
streaming=True,
)
it = iter(dataset["train"])
sample = next(it)
```
Each `sample` loaded above contains all annotations for one video, and it has three fields:
* `video_uid` *(string)*: YouTube video id of the source video.
* `metadata` *(dict)*: video-level metadata (title / description / ASR transcript, etc.)
* `nodes` *(list[dict])*: annotations for each segments.
Each element in `nodes` is a temporally localized segment in the hierachical Tree-of-Captions, it contains:
* `start`, `end` *(float)*: segment boundaries in seconds within the full video.
* `node_id` *(string)*: unique id of this segment node.
* `parent_id` *(string or null)*: id of the parent segment. The root node (corresponding to the entire video) has `parent_id = null`.
* `level` *(int)*: depth in the hierarchy. Smaller `level` is coarser (longer segments); larger `level` is finer (shorter segments).
* `plm_caption` *(string or null)*: a caption generated by PLM-3B for this segment.
* `plm_action` *(string or null)*: a short action label produced by PLM-3B.
* `llama3_caption` *(string or null)*: middle frame caption produced by LLama-3.2-Vision-11B for leaf nodes.
* `gpt` *(dict or null)*: main Action100M annotations, available for segments that is not too short:
* `gpt["summary"]["brief"]`: one-sentence concise caption of the segment.
* `gpt["summary"]["detailed"]`: longer, detailed summarization of the video segment.
* `gpt["action"]["brief"]`: short verb phrase naming the step.
* `gpt["action"]["detailed"]`: imperative-style instruction describing how the action is done.
* `gpt["action"]["actor"]`: who/what performs the action (noun phrase).
## Citation
```bibtex
@article{chen2026action100m,
title={Action100M: A Large-scale Video Action Dataset},
author={Chen, Delong and Kasarla, Tejaswi and Bang, Yejin and Shukor, Mustafa and Chung, Willy and Yu, Jade and Bolourchi, Allen and Moutakanni, Théo and Fung, Pascale},
journal={arXiv preprint arXiv:2601.10592},
year={2026}
}
```
---
语言:
- 英语
许可证:非商业公平研究许可(fair-noncommercial-research-license)
规模类别:
- 1000万 < 总规模 < 1亿
任务类别:
- 视频分类(video-classification)
- 视频-文本到文本(video-text-to-text)
标签:
- 视频
- 动作
arXiv编号:2601.10592
---
# Action100M:大规模视频动作数据集(Action100M)
【论文】|【GitHub仓库】
Action100M是一个大规模数据集,源自120万条互联网教学视频(总时长达14.6年),共生成约1亿个带有开放词汇动作标注与丰富字幕的时序局部片段。该数据集可作为视频理解与世界建模领域可扩展研究的基础支撑数据集。
## 加载Action100M标注数据
我们的数据集可从🤗 Hugging Face(Hugging Face)数据集仓库的 [`facebook/action100m-preview`](https://huggingface.co/datasets/facebook/action100m-preview) 中加载,我们在此发布了完整Action100M数据集的10%以供预览。如需了解从本地Parquet格式(Parquet)文件(从克隆的仓库获取)加载数据以及可视化的示例,请参阅[GitHub仓库](https://github.com/facebookresearch/action100m)。
python
from datasets import load_dataset
dataset = load_dataset(
"parquet",
data_files=f"hf://datasets/facebook/Action100M-preview/data/*.parquet",
streaming=True,
)
it = iter(dataset["train"])
sample = next(it)
上述加载的每个`sample`均包含单个视频的全部标注信息,共包含三个字段:
* `video_uid`(字符串类型):源视频的YouTube视频标识符。
* `metadata`(字典类型):视频级元数据(包含标题、描述、自动语音识别(ASR)转录文本等)。
* `nodes`(字典列表类型):各视频片段的标注信息。
`nodes`中的每个元素均为层级化字幕树(Tree-of-Captions)中的一个时序局部片段,其包含以下字段:
* `start`、`end`(浮点型):该片段在完整视频中的时间边界(单位为秒)。
* `node_id`(字符串类型):该片段节点的唯一标识符。
* `parent_id`(字符串或空值):父片段的标识符。根节点(对应完整视频)的`parent_id`为null。
* `level`(整型):该节点在层级结构中的深度。`level`值越小,对应片段越粗糙(时长越长);`level`值越大,对应片段越精细(时长越短)。
* `plm_caption`(字符串或空值):由PLM-3B为该片段生成的字幕。
* `plm_action`(字符串或空值):由PLM-3B生成的简短动作标签。
* `llama3_caption`(字符串或空值):由LLama-3.2-Vision-11B为叶节点生成的中间帧字幕。
* `gpt`(字典或空值):Action100M的核心标注信息,仅对时长不过短的片段提供该字段:
* `gpt["summary"]["brief"]`:该片段的单句简洁字幕。
* `gpt["summary"]["detailed"]`:该视频片段的详细长文本摘要。
* `gpt["action"]["brief"]`:描述该步骤的简短动词短语。
* `gpt["action"]["detailed"]`:描述动作执行方式的祈使式说明。
* `gpt["action"]["actor"]`:执行该动作的主体(名词短语形式)。
## 引用格式
bibtex
@article{chen2026action100m,
title={Action100M: A Large-scale Video Action Dataset},
author={Chen, Delong and Kasarla, Tejaswi and Bang, Yejin and Shukor, Mustafa and Chung, Willy and Yu, Jade and Bolourchi, Allen and Moutakanni, Théo and Fung, Pascale},
journal={arXiv preprint arXiv:2601.10592},
year={2026}
}
提供机构:
mingjiexie



