HJOK/openetd-metadata
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/HJOK/openetd-metadata
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language: en
task_categories:
- audio-classification
tags:
- end-turn-detection
- turn-taking
- spoken-dialogue
- conversation
pretty_name: OpenETD Metadata
size_categories:
- 100K<n<1M
---
# OpenETD metadata
Train / dev / test split CSVs for the OpenETD dataset released with the ACL 2026 Findings paper *Speculative End-Turn Detector for Efficient Speech Chatbot Assistant* ([arXiv:2503.23439](https://arxiv.org/abs/2503.23439)).
Code: https://github.com/HJ-Ok/OpenETD
## Split sizes
| Split | Real files | Real hours | Synthetic files | Synthetic hours |
|-------|------------|------------|-----------------|-----------------|
| train | 6,290 | 117.2 | 96,773 | 116.8 |
| dev | 899 | 16.2 | 12,840 | 15.8 |
| test | 1,798 | 32.4 | 12,868 | 15.7 |
## Columns
| Column | Description |
|------------------|------------------------------------------------------------------------|
| `file_path` | Relative path to the audio file (resolve locally). |
| `pause_times` | Interval list `(start, end), ...` of within-speaker pauses (seconds). |
| `gap_times` | Interval list `(start, end), ...` of between-speaker gaps (seconds). |
| `contains_pause` | Boolean, whether the file contains any pause. |
| `contains_gap` | Boolean, whether the file contains any gap. |
| `label` | Type of the final silence (`Pause` or `Gap`); used for the binary task.|
| `platform` | (Real only) `buckeye` or `youtube`. |
| `kfold` | (Synthetic only) k-fold assignment used for pause/gap label generation.|
## Audio files
**Audio is NOT included** in this repository — we redistribute only the annotations and split assignments. To obtain the audio:
- **Buckeye audio**: obtain from the [Buckeye Corpus](https://buckeyecorpus.osu.edu/) maintainers under their Academic License, then place files under `data/real/audio/buckeye_full/`.
- **YouTube audio**: download with the helper script in [OpenETD repository](https://github.com/HJ-Ok/OpenETD) (`scripts/prepare_data.sh`).
- **Synthetic audio**: regenerate on your own Google Cloud account using `data/synthetic_pipeline/generate.py` in the OpenETD repository.
## Quick start
```python
from datasets import load_dataset
ds = load_dataset("HJOK/openetd-metadata", data_files={
"real_train": "real/train.csv",
"real_valid": "real/valid.csv",
"real_test": "real/test.csv",
"syn_train": "synthetic/train.csv",
"syn_valid": "synthetic/valid.csv",
"syn_test": "synthetic/test.csv",
})
print(ds["real_test"][0])
```
## License
- Annotations (this repository): **CC BY 4.0**
- Code in the OpenETD GitHub repository: **MIT**
- External audio sources retain their original licenses (see `DATA_LICENSES.md` in the GitHub repo).
## Citation
```bibtex
@inproceedings{ok2026speculativeetd,
title = {Speculative End-Turn Detector for Efficient Speech Chatbot Assistant},
author = {Ok, Hyunjong and Yoo, Suho and Lee, Jaeho},
booktitle = {Findings of the Association for Computational Linguistics: ACL 2026},
year = {2026},
url = {https://arxiv.org/abs/2503.23439}
}
```
---
license: CC BY 4.0
language: 英语(en)
task_categories:
- 音频分类(audio-classification)
tags:
- 结束回合检测(end-turn-detection)
- 回合切换(turn-taking)
- 口语对话(spoken-dialogue)
- 对话(conversation)
pretty_name: OpenETD元数据(OpenETD Metadata)
size_categories:
- 10万 < 样本数 < 100万
---
# OpenETD元数据(OpenETD Metadata)
本数据集为配套ACL 2026 Findings论文《面向高效语音聊天机器人助手的推测式结束回合检测器(Speculative End-Turn Detector for Efficient Speech Chatbot Assistant)》([arXiv:2503.23439](https://arxiv.org/abs/2503.23439))所发布的OpenETD数据集的训练/开发/测试划分CSV文件。
代码仓库:https://github.com/HJ-Ok/OpenETD
## 划分规模
| 划分(Split) | 真实文件数(Real files) | 真实时长(小时)(Real hours) | 合成文件数(Synthetic files) | 合成时长(小时)(Synthetic hours) |
|-------|------------|------------|-----------------|-----------------|
| 训练集(train) | 6,290 | 117.2 | 96,773 | 116.8 |
| 开发集(dev) | 899 | 16.2 | 12,840 | 15.8 |
| 测试集(test) | 1,798 | 32.4 | 12,868 | 15.7 |
## 字段说明
| 字段名(Column) | 描述(Description) |
|------------------|------------------------------------------------------------------------|
| `file_path` | 音频文件的相对路径(可本地解析)。 |
| `pause_times` | 说话者内部停顿的区间列表`(start, end), ...`,单位为秒。 |
| `gap_times` | 说话者之间间隙的区间列表`(start, end), ...`,单位为秒。 |
| `contains_pause` | 布尔值,标记该文件是否包含任何停顿。 |
| `contains_gap` | 布尔值,标记该文件是否包含任何间隙。 |
| `label` | 最终静音的类型(`Pause`或`Gap`),用于该二分类任务。|
| `platform` | (仅真实数据)取值为`buckeye`或`youtube`。 |
| `kfold` | (仅合成数据)用于生成停顿/间隙标签的K折分配标识。|
## 音频文件说明
**本仓库不包含音频文件**——我们仅重新分发标注信息与划分分配结果。如需获取音频文件,请按以下方式操作:
- **Buckeye音频**:需按照[Buckeye语料库(Buckeye Corpus)](https://buckeyecorpus.osu.edu/)的学术许可要求,从其维护方处获取,随后将文件放置于`data/real/audio/buckeye_full/`目录下。
- **YouTube音频**:可通过[OpenETD代码仓库](https://github.com/HJ-Ok/OpenETD)中的辅助脚本`scripts/prepare_data.sh`进行下载。
- **合成音频**:可通过OpenETD代码仓库中的`data/synthetic_pipeline/generate.py`,在您自己的Google Cloud账号上自行生成。
## 快速上手
python
from datasets import load_dataset
ds = load_dataset("HJOK/openetd-metadata", data_files={
"real_train": "real/train.csv",
"real_valid": "real/valid.csv",
"real_test": "real/test.csv",
"syn_train": "synthetic/train.csv",
"syn_valid": "synthetic/valid.csv",
"syn_test": "synthetic/test.csv",
})
print(ds["real_test"][0])
## 许可协议
- 本仓库中的标注信息:采用**CC BY 4.0**许可。
- OpenETD GitHub仓库中的代码:采用**MIT**许可。
- 外部音频源保留其原始许可(详见GitHub仓库中的`DATA_LICENSES.md`文件)。
## 引用格式
bibtex
@inproceedings{ok2026speculativeetd,
title = {Speculative End-Turn Detector for Efficient Speech Chatbot Assistant},
author = {Ok, Hyunjong and Yoo, Suho and Lee, Jaeho},
booktitle = {Findings of the Association for Computational Linguistics: ACL 2026},
year = {2026},
url = {https://arxiv.org/abs/2503.23439}
}
提供机构:
HJOK



