kyutai/HaluEvalAudio_1000

Name: kyutai/HaluEvalAudio_1000
Creator: kyutai
Published: 2026-04-15 14:22:09
License: 暂无描述

Hugging Face2026-04-15 更新2026-05-10 收录

下载链接：

https://hf-mirror.com/datasets/kyutai/HaluEvalAudio_1000

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: audio dtype: audio - name: text dtype: string - name: knowledge dtype: string - name: answer dtype: string splits: - name: test num_bytes: 799855759 num_examples: 1000 download_size: 672236428 dataset_size: 799855759 configs: - config_name: default data_files: - split: test path: data/test-* license: - cc-by-nc-4.0 - mit task_categories: - audio-text-to-text language: - en size_categories: - 1K<n<10K pretty_name: 'HaluEvalAudio 1000' --- # HaluEvalAudio 1000 Dataset <p align="center"> <img src="haluevalaudio_1000.png" width="500" alt="HeluEvalAUdio 1000 Logo"> </p> ## Dataset Description **HaluEvalAudio 1000** is a specialized speech-based question-answering dataset designed to benchmark the capabilities of **general multimodal & audio-focused language models** as well as **retrieval-augmented audio language models**. Compared to common QA benchmarks such as Llama Questions, Web Questions, or TriviaQA, HaluEvalAudio 1000 introduces more challenging questions and topics and is specifically structured for Retrieval-Augmented Generation **(RAG) evaluation**. Deriveds from the [HaluEval dataset](https://aclanthology.org/2023.emnlp-main.397/), A key feature of the dataset is that it provides **ground-truth references** in text format. This enables two complementary evaluation setups for RAG models: (1) End-to-End RAG, utilizing the model’s internal retrieval pipeline, and (2) Oracle-Aided Generation, where providing ground-truth context as an ablation study allows researchers to isolate retrieval quality from downstream generative performance. --- ## Dataset Summary * **Source:** The `qa` subset of the HaluEval dataset. * **Total instances:** 1,000 WAV audio files synthesized with Kyutai's TTS model, with paired reference (knowledge) and ground-truth answers provided in textual format. --- ## Data Format Each entry in the dataset contains the following fields: - `audio`: The synthesized WAV file containing the spoken question. - `text`: The text transcription of the `audio` question. - `knowledge`: The ground-truth textual knowledge from the original HaluEval dataset. - `answer`: The ground-truth textual answer from the original HaluEval dataset. --- ## Dataset Construction We extract the first 1,000 instances from the `qa` subset of the HaluEval dataset. We use Kyutai's multistream TTS model (the same model is used for generating MoshiRAG's training data, but with voices sampled from a different dataset), and randomly sample speaker voices from the [Common Voice](https://www.mozillafoundation.org/en/common-voice/) dataset to convert the textual questions into audio. The original textual knowledge and ground-truth answers from HaluEval are preserved, while the `hallucinated_answer` field is removed for simplicity. --- ## Citations If you use this dataset, please cite: ```bibtex @misc{chien2026moshirag, title={MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models}, author={Chung-Ming Chien and Manu Orsini and Eugene Kharitonov and Neil Zeghidour and Karen Livescu and Alexandre D{\'e}fossez}, year={2026}, eprint={2604.12928}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2604.12928}, } ``` ## Acknowledgment & Licensing This dataset is a derivative work. Audio Files are licensed under CC BY-NC 4.0. Source Text is derived from the Halueval dataset which is licensed under the MIT License (Copyright (c) 2020 RUCAIBox).

数据集信息：特征字段： - 名称：audio，数据类型：音频（audio） - 名称：text，数据类型：字符串（string） - 名称：knowledge，数据类型：字符串（string） - 名称：answer，数据类型：字符串（string）数据集划分： - 划分集名称：test，占用字节数：799855759，样本数量：1000 下载大小：672236428 字节数据集总大小：799855759 字节配置项： - 配置名称：default，数据文件： - 划分集：test，文件路径：data/test-* 许可证：CC BY-NC 4.0、MIT 任务类别：音频转文本（audio-text-to-text）语言：英语（en）样本规模区间：1000 < 样本数 < 10000（1K<n<10K）展示名称：HaluEvalAudio 1000 # HaluEvalAudio 1000 数据集 <p align="center"> <img src="haluevalaudio_1000.png" width="500" alt="HaluEvalAudio 1000 标识"> </p> ## 数据集描述 **HaluEvalAudio 1000** 是一款专为语音问答任务设计的专项数据集，旨在评测通用多模态与音频聚焦大语言模型（general multimodal & audio-focused language models）以及检索增强型音频语言模型（retrieval-augmented audio language models）的性能。相较于Llama Questions、Web Questions、TriviaQA等常见问答基准数据集，HaluEvalAudio 1000设计了更具挑战性的问题与话题，并专门针对检索增强生成（Retrieval-Augmented Generation, RAG）评测进行了结构化构建。该数据集衍生自[HaluEval数据集（HaluEval dataset）](https://aclanthology.org/2023.emnlp-main.397/)，其核心特色在于提供了文本格式的真实参考文本。这一设计支持两种互补的RAG模型评测方案：(1) 端到端RAG：利用模型内置的检索流水线完成任务；(2) 辅助参考生成：通过提供真实上下文作为消融实验变量，使研究者能够将检索质量与下游生成性能解耦分析。 --- ## 数据集概览 * **来源**：HaluEval数据集的`qa`子集。 * **总样本数**：共1000个WAV格式音频文件，由Kyutai的文本转语音（Text-to-Speech, TTS）模型合成，每个音频均配有对应的参考知识与真实标准答案的文本形式数据。 --- ## 数据格式数据集中的每个条目包含以下字段： - `audio`：包含口语化问题的合成WAV音频文件 - `text`：`audio`字段对应问题的文本转录内容 - `knowledge`：源自原始HaluEval数据集的真实文本参考知识 - `answer`：源自原始HaluEval数据集的真实标准答案文本 --- ## 数据集构建流程我们从HaluEval数据集的`qa`子集中提取了前1000个样本。使用Kyutai的多流TTS模型（该模型同样用于生成MoshiRAG的训练数据，但从不同数据集采样了语音音色），并从[Common Voice](https://www.mozillafoundation.org/en/common-voice/)数据集随机采样说话人音色，将文本形式的问题转换为音频文件。我们保留了原始HaluEval数据集中的文本参考知识与真实标准答案，为简化处理移除了`hallucinated_answer`（幻觉答案）字段。 --- ## 引用说明若使用本数据集，请引用以下文献： bibtex @misc{chien2026moshirag, title={MoshiRAG: 面向全双工语音语言模型的异步知识检索}, author={Chung-Ming Chien、Manu Orsini、Eugene Kharitonov、Neil Zeghidour、Karen Livescu、Alexandre Défossez}, year={2026}, eprint={2604.12928}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2604.12928}, } ## 致谢与许可声明本数据集为衍生作品。音频文件采用CC BY-NC 4.0许可证。源文本源自HaluEval数据集，该数据集采用MIT许可证（版权所有 © 2020 RUCAIBox）。

提供机构：

kyutai

5,000+

优质数据集

54 个

任务类型

进入经典数据集