sboughorbel/human_behavior_atlas_v2
收藏Hugging Face2026-03-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/sboughorbel/human_behavior_atlas_v2
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
task_categories:
- video-classification
- audio-classification
- text-classification
- question-answering
- visual-question-answering
language:
- en
- zh
tags:
- multimodal
- emotion-recognition
- sentiment-analysis
- humor-detection
- mental-health
- video-qa
- reinforcement-learning
- verl
- rl-training
- qwen2.5-omni
- audio
- video
- pose-estimation
- opensmile
pretty_name: Human Behavior Atlas v2
arxiv: 2510.04899
size_categories:
- 10K<n<100K
configs:
- config_name: default
data_files:
- split: train
path: train-*.parquet
- split: validation
path: validation-*.parquet
- split: test
path: test-*.parquet
dataset_info:
features:
- name: problem
dtype: string
- name: answer
dtype: string
- name: images
sequence: binary
- name: videos
sequence: binary
- name: audios
sequence: binary
- name: dataset
dtype: string
- name: modality_signature
dtype: string
- name: ext_video_feats
sequence: binary
- name: ext_audio_feats
sequence: binary
- name: task
dtype: string
- name: class_label
dtype: string
splits:
- name: train
num_examples: 74449
- name: validation
num_examples: 7646
- name: test
num_examples: 18204
---
# Human Behavior Atlas v2
A large-scale multimodal dataset for human behavior understanding, spanning emotion recognition, sentiment analysis, humor detection, mental health screening, and video question answering. The dataset integrates 16 source datasets into a unified schema with audio, video, and pre-extracted features, designed for reinforcement learning training with the [verl](https://github.com/volcengine/verl) framework and multimodal language models such as Qwen2.5-Omni-7B.
## Dataset Summary
| Property | Value |
|---|---|
| Total samples | 100,299 |
| Train split | 74,449 |
| Validation split | 7,646 |
| Test split | 18,204 |
| Source datasets | 16 |
| Modalities | Text, Audio (.wav bytes), Video (.mp4 bytes), OpenSmile features (.pt bytes), Pose features (.pt bytes) — all embedded in parquet |
| Languages | English, Chinese (CHSIMSv2) |
| License | CC BY-NC 4.0 |
## Modality Distribution
| Modality Signature | Samples | Percentage |
|---|---|---|
| text_video_audio | 87,318 | 87.1% |
| text_audio | 10,431 | 10.4% |
| text | 2,550 | 2.5% |
## Source Datasets
| Dataset | Samples | Task | Modality | Description |
|---|---|---|---|---|
| **mosei_senti** | 22,740 | Sentiment classification | text_video_audio | CMU-MOSEI sentiment analysis (negative/neutral/positive) |
| **intentqa** | 14,158 | Video QA | text_video_audio | Intent-driven video question answering |
| **meld_senti** | 13,518 | Sentiment classification | text_video_audio | MELD multimodal sentiment (from Friends TV series) |
| **meld_emotion** | 13,350 | Emotion classification | text_video_audio | MELD multimodal emotion recognition (7 classes) |
| **mosei_emotion** | 8,545 | Emotion classification | text_video_audio | CMU-MOSEI emotion recognition (6 classes) |
| **cremad** | 7,442 | Emotion classification | text_audio | CREMA-D acted emotional speech recognition |
| **siq2** | 6,394 | Video QA | text_video_audio | Social IQ 2.0 social intelligence QA |
| **chsimsv2** | 4,384 | Sentiment classification | text_video_audio | CH-SIMS v2 Chinese multimodal sentiment |
| **tess** | 2,800 | Emotion classification | text_audio | Toronto Emotional Speech Set |
| **urfunny** | 2,113 | Humor classification | text_video_audio | UR-Funny multimodal humor detection |
| **mmpsy_depression** | 1,275 | Depression screening | text_video_audio | Multimodal depression assessment |
| **mmpsy_anxiety** | 1,275 | Anxiety screening | text_video_audio | Multimodal anxiety assessment |
| **mimeqa** | 801 | Video QA | text_video_audio | MIME gesture-based QA |
| **mmsd** | 687 | Humor classification | text | Multimodal sarcasm detection (text only) |
| **ptsd_in_the_wild** | 628 | PTSD detection | text_video_audio | PTSD detection from video interviews |
| **daicwoz** | 189 | Depression screening | text_video_audio | DAIC-WOZ clinical depression interviews |
## Task Types
| Task ID | Description | Datasets |
|---|---|---|
| `emotion_cls` | Emotion classification | mosei_emotion, meld_emotion, cremad, tess |
| `sentiment_cls` | Sentiment classification / regression | mosei_senti, meld_senti, chsimsv2 |
| `humor_cls` | Humor and sarcasm detection | urfunny, mmsd |
| `depression` | Depression screening | mmpsy_depression, daicwoz |
| `anxiety` | Anxiety screening | mmpsy_anxiety |
| `ptsd` | PTSD detection | ptsd_in_the_wild |
| `video_qa` | Video question answering | intentqa, siq2, mimeqa |
## Schema
Each row in the Parquet files contains the following columns:
| Column | Type | Description |
|---|---|---|
| `problem` | string | Prompt text with modality markers (`<audio>`, `<video>`) |
| `answer` | string | Ground truth answer |
| `audios` | list[bytes] | Raw .wav audio bytes (embedded) |
| `videos` | list[bytes] | Raw .mp4 video bytes (embedded) |
| `images` | list[bytes] | Image bytes (currently unused) |
| `dataset` | string | Source dataset name |
| `modality_signature` | string | Modality combination: `text_video_audio`, `text_audio`, or `text` |
| `ext_video_feats` | list[bytes] | Pose estimation feature tensors (.pt bytes, embedded) |
| `ext_audio_feats` | list[bytes] | OpenSmile audio feature tensors (.pt bytes, embedded) |
| `task` | string | Task type identifier |
| `class_label` | string | Classification label |
## Repository Structure
```
sboughorbel/human_behavior_atlas_v2/
train-00000-of-XXXXX.parquet # Sharded parquet with embedded audio/video
train-00001-of-XXXXX.parquet
...
validation-*.parquet
test-*.parquet
```
All data — including audio, video, and pre-extracted features — is fully embedded in the parquet files. No separate downloads or extraction needed.
## Usage
### Loading with HuggingFace Datasets
```python
from datasets import load_dataset
# Stream without downloading everything
ds = load_dataset("sboughorbel/human_behavior_atlas_v2", split="train", streaming=True)
sample = next(iter(ds))
# Load a subset
ds_100 = load_dataset("sboughorbel/human_behavior_atlas_v2", split="train[:100]")
# Filter by task or modality
emotion_ds = ds_100.filter(lambda x: x["task"] == "emotion_cls")
```
### Accessing Embedded Media
```python
import io
import soundfile as sf
sample = ds_100[0]
# Audio is raw bytes — decode with soundfile or torchaudio
if sample["audios"]:
audio_data, sr = sf.read(io.BytesIO(sample["audios"][0]))
# Video is raw bytes — decode with decord, opencv, or write to temp file
if sample["videos"]:
video_bytes = sample["videos"][0]
# e.g., with decord:
# from decord import VideoReader
# vr = VideoReader(io.BytesIO(video_bytes))
```
### Download and Setup
```bash
# Download full dataset
huggingface-cli download sboughorbel/human_behavior_atlas_v2 \
--repo-type dataset --local-dir /path/to/data
# Or download specific splits only
huggingface-cli download sboughorbel/human_behavior_atlas_v2 \
--repo-type dataset --local-dir /path/to/data \
--include "train-*.parquet"
```
### Integration with verl RL Training
This dataset is designed for RL training with [verl](https://github.com/volcengine/verl) using Qwen2.5-Omni-7B. The `problem` field contains structured prompts with `<audio>` and `<video>` modality markers. Audio and video bytes are loaded directly from parquet — no path resolution needed.
All data including feature tensors is embedded directly in the parquet files.
```bash
# verl training config
python3 -m verl.trainer.main_ppo \
data.train_files=/path/to/data/train-*.parquet \
data.val_files=/path/to/data/validation-*.parquet \
data.prompt_key=problem \
data.image_key=images \
data.video_key=videos \
data.modalities='audio,videos' \
...
```
## Citation
If you use this dataset, please cite the following paper:
```bibtex
@article{Ong2025HumanBehavior,
title={Human Behavior Atlas: Benchmarking Unified Psychological and Social Behavior Understanding},
author={Ong, Keane and Dai, Wei and Li, Carol and Feng, Dewei and Li, Hengzhi and Wu, Jingyao and Cheong, Jiaee and Mao, Rui and Mengaldo, Gianmarco and Cambria, Erik and Liang, Paul Pu},
journal={arXiv preprint arXiv:2510.04899},
year={2025}
}
```
> Keane Ong, Wei Dai, Carol Li, Dewei Feng, Hengzhi Li, Jingyao Wu, Jiaee Cheong, Rui Mao, Gianmarco Mengaldo, Erik Cambria, Paul Pu Liang. "Human Behavior Atlas: Benchmarking Unified Psychological and Social Behavior Understanding." ICLR 2026. [arXiv:2510.04899](https://arxiv.org/abs/2510.04899)
Please also cite the individual source datasets as appropriate:
- CMU-MOSEI: Zadeh et al., "Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph", ACL 2018
- MELD: Poria et al., "MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations", ACL 2019
- CREMA-D: Cao et al., "CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset", IEEE TAC 2014
- DAIC-WOZ: Gratch et al., "The Distress Analysis Interview Corpus of Human and Computer Interviews", LREC 2014
- CH-SIMS v2: Liu et al., "Make Acoustic and Visual Cues Matter: CH-SIMS v2.0 Dataset and AV-Mixup Consistent Module", ICMI 2022
## License
This dataset is released under the [Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/) license. Individual source datasets may have their own licensing terms; please consult the original dataset publications for details.
许可证:CC BY-NC 4.0(知识共享署名-非商业性使用4.0国际许可协议)
任务类别:
- 视频分类
- 音频分类
- 文本分类
- 问答
- 视觉问答
语言:
- 英语
- 中文
标签:
- 多模态
- 情绪识别
- 情感分析
- 幽默检测
- 心理健康
- 视频问答
- 强化学习
- verl
- 强化学习训练
- Qwen2.5-Omni
- 音频
- 视频
- 姿态估计
- OpenSmile
数据集名称:Human Behavior Atlas v2
arXiv编号:2510.04899
样本量范围:10K < n < 100K
配置项:
- 配置名称:default
数据文件:
- 拆分集:训练集,路径:train-*.parquet
- 拆分集:验证集,路径:validation-*.parquet
- 拆分集:测试集,路径:test-*.parquet
数据集信息:
特征项:
- 名称:problem,数据类型:字符串
- 名称:answer,数据类型:字符串
- 名称:images,数据类型:二进制序列
- 名称:videos,数据类型:二进制序列
- 名称:audios,数据类型:二进制序列
- 名称:dataset,数据类型:字符串
- 名称:modality_signature,数据类型:字符串
- 名称:ext_video_feats,数据类型:二进制序列
- 名称:ext_audio_feats,数据类型:二进制序列
- 名称:task,数据类型:字符串
- 名称:class_label,数据类型:字符串
拆分集:
- 拆分集名称:训练集,样本数量:74449
- 拆分集名称:验证集,样本数量:7646
- 拆分集名称:测试集,样本数量:18204
# 人类行为图谱v2(Human Behavior Atlas v2)
一款面向人类行为理解的大规模多模态数据集,涵盖情绪识别、情感分析、幽默检测、心理健康筛查与视频问答任务方向。该数据集整合了16个源数据集,采用统一的模式架构,包含音频、视频与预提取特征,专为基于[verl](https://github.com/volcengine/verl)框架的强化学习训练,以及通义千问2.5全模态(Qwen2.5-Omni-7B)等多模态大语言模型设计。
## 数据集概述
| 数据集属性 | 数值 |
|---|---|
| 总样本量 | 100,299 |
| 训练集拆分 | 74,449 |
| 验证集拆分 | 7,646 |
| 测试集拆分 | 18,204 |
| 源数据集数量 | 16 |
| 模态类型 | 文本、音频(.wav二进制格式)、视频(.mp4二进制格式)、OpenSmile特征(.pt二进制格式)、姿态估计特征(.pt二进制格式)—— 所有数据均内嵌于Parquet文件中 |
| 支持语言 | 英语、中文(CHSIMSv2) |
| 许可证 | CC BY-NC 4.0 |
## 模态分布
| 模态组合标识 | 样本数量 | 占比 |
|---|---|---|
| 文本-视频-音频 | 87,318 | 87.1% |
| 文本-音频 | 10,431 | 10.4% |
| 仅文本 | 2,550 | 2.5% |
## 源数据集
| 数据集名称 | 样本数量 | 任务类型 | 模态类型 | 数据集描述 |
|---|---|---|---|---|
| **mosei_senti** | 22,740 | 情感分类 | 文本-视频-音频 | CMU-MOSEI情感分析(包含负面/中性/正面三类标签) |
| **intentqa** | 14,158 | 视频问答 | 文本-视频-音频 | 意图驱动的视频问答任务 |
| **meld_senti** | 13,518 | 情感分类 | 文本-视频-音频 | MELD多模态情感分析(源自《老友记》剧集) |
| **meld_emotion** | 13,350 | 情绪分类 | 文本-视频-音频 | MELD多模态情绪识别(共7类标签) |
| **mosei_emotion** | 8,545 | 情绪分类 | 文本-视频-音频 | CMU-MOSEI情绪识别(共6类标签) |
| **cremad** | 7,442 | 情绪分类 | 文本-音频 | CREMA-D 表演式情感语音识别任务 |
| **siq2** | 6,394 | 视频问答 | 文本-视频-音频 | Social IQ 2.0 社会智能问答任务 |
| **chsimsv2** | 4,384 | 情感分类 | 文本-视频-音频 | CH-SIMS v2 中文多模态情感分析数据集 |
| **tess** | 2,800 | 情绪分类 | 文本-音频 | Toronto Emotional Speech Set(多伦多情感语音集) |
| **urfunny** | 2,113 | 幽默分类 | 文本-视频-音频 | UR-Funny 多模态幽默检测数据集 |
| **mmpsy_depression** | 1,275 | 抑郁症筛查 | 文本-视频-音频 | 多模态抑郁症评估任务 |
| **mmpsy_anxiety** | 1,275 | 焦虑症筛查 | 文本-视频-音频 | 多模态焦虑症评估任务 |
| **mimeqa** | 801 | 视频问答 | 文本-视频-音频 | MIME 基于手势的视频问答任务 |
| **mmsd** | 687 | 幽默分类 | 仅文本 | Multimodal Sarcasm Detection(多模态讽刺检测,仅文本模态) |
| **ptsd_in_the_wild** | 628 | PTSD检测 | 文本-视频-音频 | 基于视频访谈的创伤后应激障碍检测任务 |
| **daicwoz** | 189 | 抑郁症筛查 | 文本-视频-音频 | DAIC-WOZ 临床抑郁症访谈数据集 |
## 任务类型
| 任务标识符 | 任务描述 | 对应数据集 |
|---|---|---|
| `emotion_cls` | 情绪分类 | mosei_emotion、meld_emotion、cremad、tess |
| `sentiment_cls` | 情感分类/回归 | mosei_senti、meld_senti、chsimsv2 |
| `humor_cls` | 幽默与讽刺检测 | urfunny、mmsd |
| `depression` | 抑郁症筛查 | mmpsy_depression、daicwoz |
| `anxiety` | 焦虑症筛查 | mmpsy_anxiety |
| `ptsd` | 创伤后应激障碍(PTSD)检测 | ptsd_in_the_wild |
| `video_qa` | 视频问答 | intentqa、siq2、mimeqa |
## 数据架构
Parquet文件中的每一行包含以下字段:
| 列名 | 数据类型 | 字段说明 |
|---|---|---|
| `problem` | 字符串 | 包含模态标记(`<audio>`、`<video>`)的提示文本 |
| `answer` | 字符串 | 标准答案(真值标签) |
| `audios` | 二进制列表 | 内嵌的原始.wav音频二进制数据 |
| `videos` | 二进制列表 | 内嵌的原始.mp4视频二进制数据 |
| `images` | 二进制列表 | 图像二进制数据(当前未使用) |
| `dataset` | 字符串 | 源数据集名称 |
| `modality_signature` | 字符串 | 模态组合类型:`text_video_audio`(文本-视频-音频)、`text_audio`(文本-音频)或`text`(仅文本) |
| `ext_video_feats` | 二进制列表 | 内嵌的姿态估计特征张量(.pt二进制格式) |
| `ext_audio_feats` | 二进制列表 | 内嵌的OpenSmile音频特征张量(.pt二进制格式) |
| `task` | 字符串 | 任务类型标识符 |
| `class_label` | 字符串 | 分类标签 |
## 仓库结构
sboughorbel/human_behavior_atlas_v2/
train-00000-of-XXXXX.parquet # 包含内嵌音视频数据的分片Parquet文件
train-00001-of-XXXXX.parquet
...
validation-*.parquet
test-*.parquet
所有数据(包括音频、视频与预提取特征)均完全内嵌于Parquet文件中,无需额外下载或解压。
## 使用方法
### 使用HuggingFace Datasets加载
python
from datasets import load_dataset
# 流式加载无需下载全部数据
ds = load_dataset("sboughorbel/human_behavior_atlas_v2", split="train", streaming=True)
sample = next(iter(ds))
# 加载子集数据
ds_100 = load_dataset("sboughorbel/human_behavior_atlas_v2", split="train[:100]")
# 按任务或模态过滤数据集
emotion_ds = ds_100.filter(lambda x: x["task"] == "emotion_cls")
### 访问内嵌多媒体数据
python
import io
import soundfile as sf
sample = ds_100[0]
# 音频为原始二进制数据——可使用soundfile或torchaudio解码
if sample["audios"]:
audio_data, sr = sf.read(io.BytesIO(sample["audios"][0]))
# 视频为原始二进制数据——可使用decord、opencv或写入临时文件解码
if sample["videos"]:
video_bytes = sample["videos"][0]
# 例如使用decord解码:
# from decord import VideoReader
# vr = VideoReader(io.BytesIO(video_bytes))
### 下载与配置
bash
# 下载完整数据集
huggingface-cli download sboughorbel/human_behavior_atlas_v2
--repo-type dataset --local-dir /path/to/data
# 或仅下载指定拆分集
huggingface-cli download sboughorbel/human_behavior_atlas_v2
--repo-type dataset --local-dir /path/to/data
--include "train-*.parquet"
### 与verl强化学习训练集成
本数据集专为基于[verl](https://github.com/volcengine/verl)框架、使用通义千问2.5全模态(Qwen2.5-Omni-7B)的强化学习训练设计。`problem`字段包含带有`<audio>`和`<video>`模态标记的结构化提示。音频与视频二进制数据可直接从Parquet文件加载,无需路径解析。所有数据(包括特征张量)均直接内嵌于Parquet文件中。
bash
# verl训练配置
python3 -m verl.trainer.main_ppo
data.train_files=/path/to/data/train-*.parquet
data.val_files=/path/to/data/validation-*.parquet
data.prompt_key=problem
data.image_key=images
data.video_key=videos
data.modalities='audio,videos'
...
## 引用声明
若使用本数据集,请引用以下论文:
bibtex
@article{Ong2025HumanBehavior,
title={Human Behavior Atlas: Benchmarking Unified Psychological and Social Behavior Understanding},
author={Ong, Keane and Dai, Wei and Li, Carol and Feng, Dewei and Li, Hengzhi and Wu, Jingyao and Cheong, Jiaee and Mao, Rui and Mengaldo, Gianmarco and Cambria, Erik and Liang, Paul Pu},
journal={arXiv preprint arXiv:2510.04899},
year={2025}
}
> Keane Ong, Wei Dai, Carol Li, Dewei Feng, Hengzhi Li, Jingyao Wu, Jiaee Cheong, Rui Mao, Gianmarco Mengaldo, Erik Cambria, Paul Pu Liang. "Human Behavior Atlas: Benchmarking Unified Psychological and Social Behavior Understanding." ICLR 2026. [arXiv:2510.04899](https://arxiv.org/abs/2510.04899)
同时请根据需要引用对应的源数据集:
- CMU-MOSEI: Zadeh et al., "Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph", ACL 2018
- MELD: Poria et al., "MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations", ACL 2019
- CREMA-D: Cao et al., "CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset", IEEE TAC 2014
- DAIC-WOZ: Gratch et al., "The Distress Analysis Interview Corpus of Human and Computer Interviews", LREC 2014
- CH-SIMS v2: Liu et al., "Make Acoustic and Visual Cues Matter: CH-SIMS v2.0 Dataset and AV-Mixup Consistent Module", ICMI 2022
## 许可证
本数据集采用[知识共享署名-非商业性使用4.0国际许可协议(CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/)发布。部分源数据集可能拥有独立的许可证条款,请查阅原始数据集文献以获取详细信息。
提供机构:
sboughorbel



