five

kyutai/HaluEvalAudio_1000

收藏
Hugging Face2026-04-15 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/kyutai/HaluEvalAudio_1000
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: audio dtype: audio - name: text dtype: string - name: knowledge dtype: string - name: answer dtype: string splits: - name: test num_bytes: 799855759 num_examples: 1000 download_size: 672236428 dataset_size: 799855759 configs: - config_name: default data_files: - split: test path: data/test-* license: - cc-by-nc-4.0 - mit task_categories: - audio-text-to-text language: - en size_categories: - 1K<n<10K pretty_name: 'HaluEvalAudio 1000' --- # HaluEvalAudio 1000 Dataset <p align="center"> <img src="haluevalaudio_1000.png" width="500" alt="HeluEvalAUdio 1000 Logo"> </p> ## Dataset Description **HaluEvalAudio 1000** is a specialized speech-based question-answering dataset designed to benchmark the capabilities of **general multimodal & audio-focused language models** as well as **retrieval-augmented audio language models**. Compared to common QA benchmarks such as Llama Questions, Web Questions, or TriviaQA, HaluEvalAudio 1000 introduces more challenging questions and topics and is specifically structured for Retrieval-Augmented Generation **(RAG) evaluation**. Deriveds from the [HaluEval dataset](https://aclanthology.org/2023.emnlp-main.397/), A key feature of the dataset is that it provides **ground-truth references** in text format. This enables two complementary evaluation setups for RAG models: (1) End-to-End RAG, utilizing the model’s internal retrieval pipeline, and (2) Oracle-Aided Generation, where providing ground-truth context as an ablation study allows researchers to isolate retrieval quality from downstream generative performance. --- ## Dataset Summary * **Source:** The `qa` subset of the HaluEval dataset. * **Total instances:** 1,000 WAV audio files synthesized with Kyutai's TTS model, with paired reference (knowledge) and ground-truth answers provided in textual format. --- ## Data Format Each entry in the dataset contains the following fields: - `audio`: The synthesized WAV file containing the spoken question. - `text`: The text transcription of the `audio` question. - `knowledge`: The ground-truth textual knowledge from the original HaluEval dataset. - `answer`: The ground-truth textual answer from the original HaluEval dataset. --- ## Dataset Construction We extract the first 1,000 instances from the `qa` subset of the HaluEval dataset. We use Kyutai's multistream TTS model (the same model is used for generating MoshiRAG's training data, but with voices sampled from a different dataset), and randomly sample speaker voices from the [Common Voice](https://www.mozillafoundation.org/en/common-voice/) dataset to convert the textual questions into audio. The original textual knowledge and ground-truth answers from HaluEval are preserved, while the `hallucinated_answer` field is removed for simplicity. --- ## Citations If you use this dataset, please cite: ```bibtex @misc{chien2026moshirag, title={MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models}, author={Chung-Ming Chien and Manu Orsini and Eugene Kharitonov and Neil Zeghidour and Karen Livescu and Alexandre D{\'e}fossez}, year={2026}, eprint={2604.12928}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2604.12928}, } ``` ## Acknowledgment & Licensing This dataset is a derivative work. Audio Files are licensed under CC BY-NC 4.0. Source Text is derived from the Halueval dataset which is licensed under the MIT License (Copyright (c) 2020 RUCAIBox).

数据集信息: 特征字段: - 名称:audio,数据类型:音频(audio) - 名称:text,数据类型:字符串(string) - 名称:knowledge,数据类型:字符串(string) - 名称:answer,数据类型:字符串(string) 数据集划分: - 划分集名称:test,占用字节数:799855759,样本数量:1000 下载大小:672236428 字节 数据集总大小:799855759 字节 配置项: - 配置名称:default,数据文件: - 划分集:test,文件路径:data/test-* 许可证:CC BY-NC 4.0、MIT 任务类别:音频转文本(audio-text-to-text) 语言:英语(en) 样本规模区间:1000 < 样本数 < 10000(1K<n<10K) 展示名称:HaluEvalAudio 1000 # HaluEvalAudio 1000 数据集 <p align="center"> <img src="haluevalaudio_1000.png" width="500" alt="HaluEvalAudio 1000 标识"> </p> ## 数据集描述 **HaluEvalAudio 1000** 是一款专为语音问答任务设计的专项数据集,旨在评测通用多模态与音频聚焦大语言模型(general multimodal & audio-focused language models)以及检索增强型音频语言模型(retrieval-augmented audio language models)的性能。相较于Llama Questions、Web Questions、TriviaQA等常见问答基准数据集,HaluEvalAudio 1000设计了更具挑战性的问题与话题,并专门针对检索增强生成(Retrieval-Augmented Generation, RAG)评测进行了结构化构建。 该数据集衍生自[HaluEval数据集(HaluEval dataset)](https://aclanthology.org/2023.emnlp-main.397/),其核心特色在于提供了文本格式的真实参考文本。这一设计支持两种互补的RAG模型评测方案:(1) 端到端RAG:利用模型内置的检索流水线完成任务;(2) 辅助参考生成:通过提供真实上下文作为消融实验变量,使研究者能够将检索质量与下游生成性能解耦分析。 --- ## 数据集概览 * **来源**:HaluEval数据集的`qa`子集。 * **总样本数**:共1000个WAV格式音频文件,由Kyutai的文本转语音(Text-to-Speech, TTS)模型合成,每个音频均配有对应的参考知识与真实标准答案的文本形式数据。 --- ## 数据格式 数据集中的每个条目包含以下字段: - `audio`:包含口语化问题的合成WAV音频文件 - `text`:`audio`字段对应问题的文本转录内容 - `knowledge`:源自原始HaluEval数据集的真实文本参考知识 - `answer`:源自原始HaluEval数据集的真实标准答案文本 --- ## 数据集构建流程 我们从HaluEval数据集的`qa`子集中提取了前1000个样本。使用Kyutai的多流TTS模型(该模型同样用于生成MoshiRAG的训练数据,但从不同数据集采样了语音音色),并从[Common Voice](https://www.mozillafoundation.org/en/common-voice/)数据集随机采样说话人音色,将文本形式的问题转换为音频文件。我们保留了原始HaluEval数据集中的文本参考知识与真实标准答案,为简化处理移除了`hallucinated_answer`(幻觉答案)字段。 --- ## 引用说明 若使用本数据集,请引用以下文献: bibtex @misc{chien2026moshirag, title={MoshiRAG: 面向全双工语音语言模型的异步知识检索}, author={Chung-Ming Chien、Manu Orsini、Eugene Kharitonov、Neil Zeghidour、Karen Livescu、Alexandre Défossez}, year={2026}, eprint={2604.12928}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2604.12928}, } ## 致谢与许可声明 本数据集为衍生作品。 音频文件采用CC BY-NC 4.0许可证。 源文本源自HaluEval数据集,该数据集采用MIT许可证(版权所有 © 2020 RUCAIBox)。
提供机构:
kyutai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作