HJOK/openetd-metadata

Name: HJOK/openetd-metadata
Creator: HJOK
Published: 2026-04-20 16:15:03
License: 暂无描述

Hugging Face2026-04-20 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/HJOK/openetd-metadata

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 language: en task_categories: - audio-classification tags: - end-turn-detection - turn-taking - spoken-dialogue - conversation pretty_name: OpenETD Metadata size_categories: - 100K<n<1M --- # OpenETD metadata Train / dev / test split CSVs for the OpenETD dataset released with the ACL 2026 Findings paper *Speculative End-Turn Detector for Efficient Speech Chatbot Assistant* ([arXiv:2503.23439](https://arxiv.org/abs/2503.23439)). Code: https://github.com/HJ-Ok/OpenETD ## Split sizes | Split | Real files | Real hours | Synthetic files | Synthetic hours | |-------|------------|------------|-----------------|-----------------| | train | 6,290 | 117.2 | 96,773 | 116.8 | | dev | 899 | 16.2 | 12,840 | 15.8 | | test | 1,798 | 32.4 | 12,868 | 15.7 | ## Columns | Column | Description | |------------------|------------------------------------------------------------------------| | `file_path` | Relative path to the audio file (resolve locally). | | `pause_times` | Interval list `(start, end), ...` of within-speaker pauses (seconds). | | `gap_times` | Interval list `(start, end), ...` of between-speaker gaps (seconds). | | `contains_pause` | Boolean, whether the file contains any pause. | | `contains_gap` | Boolean, whether the file contains any gap. | | `label` | Type of the final silence (`Pause` or `Gap`); used for the binary task.| | `platform` | (Real only) `buckeye` or `youtube`. | | `kfold` | (Synthetic only) k-fold assignment used for pause/gap label generation.| ## Audio files **Audio is NOT included** in this repository — we redistribute only the annotations and split assignments. To obtain the audio: - **Buckeye audio**: obtain from the [Buckeye Corpus](https://buckeyecorpus.osu.edu/) maintainers under their Academic License, then place files under `data/real/audio/buckeye_full/`. - **YouTube audio**: download with the helper script in [OpenETD repository](https://github.com/HJ-Ok/OpenETD) (`scripts/prepare_data.sh`). - **Synthetic audio**: regenerate on your own Google Cloud account using `data/synthetic_pipeline/generate.py` in the OpenETD repository. ## Quick start ```python from datasets import load_dataset ds = load_dataset("HJOK/openetd-metadata", data_files={ "real_train": "real/train.csv", "real_valid": "real/valid.csv", "real_test": "real/test.csv", "syn_train": "synthetic/train.csv", "syn_valid": "synthetic/valid.csv", "syn_test": "synthetic/test.csv", }) print(ds["real_test"][0]) ``` ## License - Annotations (this repository): **CC BY 4.0** - Code in the OpenETD GitHub repository: **MIT** - External audio sources retain their original licenses (see `DATA_LICENSES.md` in the GitHub repo). ## Citation ```bibtex @inproceedings{ok2026speculativeetd, title = {Speculative End-Turn Detector for Efficient Speech Chatbot Assistant}, author = {Ok, Hyunjong and Yoo, Suho and Lee, Jaeho}, booktitle = {Findings of the Association for Computational Linguistics: ACL 2026}, year = {2026}, url = {https://arxiv.org/abs/2503.23439} } ```

--- license: CC BY 4.0 language: 英语（en） task_categories: - 音频分类（audio-classification） tags: - 结束回合检测（end-turn-detection） - 回合切换（turn-taking） - 口语对话（spoken-dialogue） - 对话（conversation） pretty_name: OpenETD元数据（OpenETD Metadata） size_categories: - 10万 < 样本数 < 100万 --- # OpenETD元数据（OpenETD Metadata）本数据集为配套ACL 2026 Findings论文《面向高效语音聊天机器人助手的推测式结束回合检测器（Speculative End-Turn Detector for Efficient Speech Chatbot Assistant）》（[arXiv:2503.23439](https://arxiv.org/abs/2503.23439)）所发布的OpenETD数据集的训练/开发/测试划分CSV文件。代码仓库：https://github.com/HJ-Ok/OpenETD ## 划分规模 | 划分（Split） | 真实文件数（Real files） | 真实时长（小时）（Real hours） | 合成文件数（Synthetic files） | 合成时长（小时）（Synthetic hours） | |-------|------------|------------|-----------------|-----------------| | 训练集（train） | 6,290 | 117.2 | 96,773 | 116.8 | | 开发集（dev） | 899 | 16.2 | 12,840 | 15.8 | | 测试集（test） | 1,798 | 32.4 | 12,868 | 15.7 | ## 字段说明 | 字段名（Column） | 描述（Description） | |------------------|------------------------------------------------------------------------| | `file_path` | 音频文件的相对路径（可本地解析）。 | | `pause_times` | 说话者内部停顿的区间列表`(start, end), ...`，单位为秒。 | | `gap_times` | 说话者之间间隙的区间列表`(start, end), ...`，单位为秒。 | | `contains_pause` | 布尔值，标记该文件是否包含任何停顿。 | | `contains_gap` | 布尔值，标记该文件是否包含任何间隙。 | | `label` | 最终静音的类型（`Pause`或`Gap`），用于该二分类任务。| | `platform` | （仅真实数据）取值为`buckeye`或`youtube`。 | | `kfold` | （仅合成数据）用于生成停顿/间隙标签的K折分配标识。| ## 音频文件说明 **本仓库不包含音频文件**——我们仅重新分发标注信息与划分分配结果。如需获取音频文件，请按以下方式操作： - **Buckeye音频**：需按照[Buckeye语料库（Buckeye Corpus）](https://buckeyecorpus.osu.edu/)的学术许可要求，从其维护方处获取，随后将文件放置于`data/real/audio/buckeye_full/`目录下。 - **YouTube音频**：可通过[OpenETD代码仓库](https://github.com/HJ-Ok/OpenETD)中的辅助脚本`scripts/prepare_data.sh`进行下载。 - **合成音频**：可通过OpenETD代码仓库中的`data/synthetic_pipeline/generate.py`，在您自己的Google Cloud账号上自行生成。 ## 快速上手 python from datasets import load_dataset ds = load_dataset("HJOK/openetd-metadata", data_files={ "real_train": "real/train.csv", "real_valid": "real/valid.csv", "real_test": "real/test.csv", "syn_train": "synthetic/train.csv", "syn_valid": "synthetic/valid.csv", "syn_test": "synthetic/test.csv", }) print(ds["real_test"][0]) ## 许可协议 - 本仓库中的标注信息：采用**CC BY 4.0**许可。 - OpenETD GitHub仓库中的代码：采用**MIT**许可。 - 外部音频源保留其原始许可（详见GitHub仓库中的`DATA_LICENSES.md`文件）。 ## 引用格式 bibtex @inproceedings{ok2026speculativeetd, title = {Speculative End-Turn Detector for Efficient Speech Chatbot Assistant}, author = {Ok, Hyunjong and Yoo, Suho and Lee, Jaeho}, booktitle = {Findings of the Association for Computational Linguistics: ACL 2026}, year = {2026}, url = {https://arxiv.org/abs/2503.23439} }

提供机构：

HJOK

5,000+

优质数据集

54 个

任务类型

进入经典数据集