five

HJOK/openetd-metadata

收藏
Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/HJOK/openetd-metadata
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: en task_categories: - audio-classification tags: - end-turn-detection - turn-taking - spoken-dialogue - conversation pretty_name: OpenETD Metadata size_categories: - 100K<n<1M --- # OpenETD metadata Train / dev / test split CSVs for the OpenETD dataset released with the ACL 2026 Findings paper *Speculative End-Turn Detector for Efficient Speech Chatbot Assistant* ([arXiv:2503.23439](https://arxiv.org/abs/2503.23439)). Code: https://github.com/HJ-Ok/OpenETD ## Split sizes | Split | Real files | Real hours | Synthetic files | Synthetic hours | |-------|------------|------------|-----------------|-----------------| | train | 6,290 | 117.2 | 96,773 | 116.8 | | dev | 899 | 16.2 | 12,840 | 15.8 | | test | 1,798 | 32.4 | 12,868 | 15.7 | ## Columns | Column | Description | |------------------|------------------------------------------------------------------------| | `file_path` | Relative path to the audio file (resolve locally). | | `pause_times` | Interval list `(start, end), ...` of within-speaker pauses (seconds). | | `gap_times` | Interval list `(start, end), ...` of between-speaker gaps (seconds). | | `contains_pause` | Boolean, whether the file contains any pause. | | `contains_gap` | Boolean, whether the file contains any gap. | | `label` | Type of the final silence (`Pause` or `Gap`); used for the binary task.| | `platform` | (Real only) `buckeye` or `youtube`. | | `kfold` | (Synthetic only) k-fold assignment used for pause/gap label generation.| ## Audio files **Audio is NOT included** in this repository — we redistribute only the annotations and split assignments. To obtain the audio: - **Buckeye audio**: obtain from the [Buckeye Corpus](https://buckeyecorpus.osu.edu/) maintainers under their Academic License, then place files under `data/real/audio/buckeye_full/`. - **YouTube audio**: download with the helper script in [OpenETD repository](https://github.com/HJ-Ok/OpenETD) (`scripts/prepare_data.sh`). - **Synthetic audio**: regenerate on your own Google Cloud account using `data/synthetic_pipeline/generate.py` in the OpenETD repository. ## Quick start ```python from datasets import load_dataset ds = load_dataset("HJOK/openetd-metadata", data_files={ "real_train": "real/train.csv", "real_valid": "real/valid.csv", "real_test": "real/test.csv", "syn_train": "synthetic/train.csv", "syn_valid": "synthetic/valid.csv", "syn_test": "synthetic/test.csv", }) print(ds["real_test"][0]) ``` ## License - Annotations (this repository): **CC BY 4.0** - Code in the OpenETD GitHub repository: **MIT** - External audio sources retain their original licenses (see `DATA_LICENSES.md` in the GitHub repo). ## Citation ```bibtex @inproceedings{ok2026speculativeetd, title = {Speculative End-Turn Detector for Efficient Speech Chatbot Assistant}, author = {Ok, Hyunjong and Yoo, Suho and Lee, Jaeho}, booktitle = {Findings of the Association for Computational Linguistics: ACL 2026}, year = {2026}, url = {https://arxiv.org/abs/2503.23439} } ```

--- license: CC BY 4.0 language: 英语(en) task_categories: - 音频分类(audio-classification) tags: - 结束回合检测(end-turn-detection) - 回合切换(turn-taking) - 口语对话(spoken-dialogue) - 对话(conversation) pretty_name: OpenETD元数据(OpenETD Metadata) size_categories: - 10万 < 样本数 < 100万 --- # OpenETD元数据(OpenETD Metadata) 本数据集为配套ACL 2026 Findings论文《面向高效语音聊天机器人助手的推测式结束回合检测器(Speculative End-Turn Detector for Efficient Speech Chatbot Assistant)》([arXiv:2503.23439](https://arxiv.org/abs/2503.23439))所发布的OpenETD数据集的训练/开发/测试划分CSV文件。 代码仓库:https://github.com/HJ-Ok/OpenETD ## 划分规模 | 划分(Split) | 真实文件数(Real files) | 真实时长(小时)(Real hours) | 合成文件数(Synthetic files) | 合成时长(小时)(Synthetic hours) | |-------|------------|------------|-----------------|-----------------| | 训练集(train) | 6,290 | 117.2 | 96,773 | 116.8 | | 开发集(dev) | 899 | 16.2 | 12,840 | 15.8 | | 测试集(test) | 1,798 | 32.4 | 12,868 | 15.7 | ## 字段说明 | 字段名(Column) | 描述(Description) | |------------------|------------------------------------------------------------------------| | `file_path` | 音频文件的相对路径(可本地解析)。 | | `pause_times` | 说话者内部停顿的区间列表`(start, end), ...`,单位为秒。 | | `gap_times` | 说话者之间间隙的区间列表`(start, end), ...`,单位为秒。 | | `contains_pause` | 布尔值,标记该文件是否包含任何停顿。 | | `contains_gap` | 布尔值,标记该文件是否包含任何间隙。 | | `label` | 最终静音的类型(`Pause`或`Gap`),用于该二分类任务。| | `platform` | (仅真实数据)取值为`buckeye`或`youtube`。 | | `kfold` | (仅合成数据)用于生成停顿/间隙标签的K折分配标识。| ## 音频文件说明 **本仓库不包含音频文件**——我们仅重新分发标注信息与划分分配结果。如需获取音频文件,请按以下方式操作: - **Buckeye音频**:需按照[Buckeye语料库(Buckeye Corpus)](https://buckeyecorpus.osu.edu/)的学术许可要求,从其维护方处获取,随后将文件放置于`data/real/audio/buckeye_full/`目录下。 - **YouTube音频**:可通过[OpenETD代码仓库](https://github.com/HJ-Ok/OpenETD)中的辅助脚本`scripts/prepare_data.sh`进行下载。 - **合成音频**:可通过OpenETD代码仓库中的`data/synthetic_pipeline/generate.py`,在您自己的Google Cloud账号上自行生成。 ## 快速上手 python from datasets import load_dataset ds = load_dataset("HJOK/openetd-metadata", data_files={ "real_train": "real/train.csv", "real_valid": "real/valid.csv", "real_test": "real/test.csv", "syn_train": "synthetic/train.csv", "syn_valid": "synthetic/valid.csv", "syn_test": "synthetic/test.csv", }) print(ds["real_test"][0]) ## 许可协议 - 本仓库中的标注信息:采用**CC BY 4.0**许可。 - OpenETD GitHub仓库中的代码:采用**MIT**许可。 - 外部音频源保留其原始许可(详见GitHub仓库中的`DATA_LICENSES.md`文件)。 ## 引用格式 bibtex @inproceedings{ok2026speculativeetd, title = {Speculative End-Turn Detector for Efficient Speech Chatbot Assistant}, author = {Ok, Hyunjong and Yoo, Suho and Lee, Jaeho}, booktitle = {Findings of the Association for Computational Linguistics: ACL 2026}, year = {2026}, url = {https://arxiv.org/abs/2503.23439} }
提供机构:
HJOK
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作