kyutai/HaluEvalAudio_1000
收藏Hugging Face2026-04-15 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/kyutai/HaluEvalAudio_1000
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: audio
dtype: audio
- name: text
dtype: string
- name: knowledge
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_bytes: 799855759
num_examples: 1000
download_size: 672236428
dataset_size: 799855759
configs:
- config_name: default
data_files:
- split: test
path: data/test-*
license:
- cc-by-nc-4.0
- mit
task_categories:
- audio-text-to-text
language:
- en
size_categories:
- 1K<n<10K
pretty_name: 'HaluEvalAudio 1000'
---
# HaluEvalAudio 1000 Dataset
<p align="center">
<img src="haluevalaudio_1000.png" width="500" alt="HeluEvalAUdio 1000 Logo">
</p>
## Dataset Description
**HaluEvalAudio 1000** is a specialized speech-based question-answering dataset designed to benchmark the capabilities of **general multimodal & audio-focused language models** as well as **retrieval-augmented audio language models**.
Compared to common QA benchmarks such as Llama Questions, Web Questions, or TriviaQA, HaluEvalAudio 1000 introduces more challenging questions and topics and is specifically structured for Retrieval-Augmented Generation **(RAG) evaluation**.
Deriveds from the [HaluEval dataset](https://aclanthology.org/2023.emnlp-main.397/), A key feature of the dataset is that it provides **ground-truth references** in text format.
This enables two complementary evaluation setups for RAG models: (1) End-to-End RAG, utilizing the model’s internal retrieval pipeline, and (2) Oracle-Aided Generation, where providing ground-truth context as an ablation study allows researchers to isolate retrieval quality from downstream generative performance.
---
## Dataset Summary
* **Source:** The `qa` subset of the HaluEval dataset.
* **Total instances:** 1,000 WAV audio files synthesized with Kyutai's TTS model, with paired reference (knowledge) and ground-truth answers provided in textual format.
---
## Data Format
Each entry in the dataset contains the following fields:
- `audio`: The synthesized WAV file containing the spoken question.
- `text`: The text transcription of the `audio` question.
- `knowledge`: The ground-truth textual knowledge from the original HaluEval dataset.
- `answer`: The ground-truth textual answer from the original HaluEval dataset.
---
## Dataset Construction
We extract the first 1,000 instances from the `qa` subset of the HaluEval dataset.
We use Kyutai's multistream TTS model (the same model is used for generating MoshiRAG's training data, but with voices sampled from a different dataset), and randomly sample speaker voices from the [Common Voice](https://www.mozillafoundation.org/en/common-voice/) dataset to convert the textual questions into audio.
The original textual knowledge and ground-truth answers from HaluEval are preserved, while the `hallucinated_answer` field is removed for simplicity.
---
## Citations
If you use this dataset, please cite:
```bibtex
@misc{chien2026moshirag,
title={MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models},
author={Chung-Ming Chien and Manu Orsini and Eugene Kharitonov and Neil Zeghidour and Karen Livescu and Alexandre D{\'e}fossez},
year={2026},
eprint={2604.12928},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.12928},
}
```
## Acknowledgment & Licensing
This dataset is a derivative work.
Audio Files are licensed under CC BY-NC 4.0.
Source Text is derived from the Halueval dataset which is licensed under the MIT License (Copyright (c) 2020 RUCAIBox).
数据集信息:
特征字段:
- 名称:audio,数据类型:音频(audio)
- 名称:text,数据类型:字符串(string)
- 名称:knowledge,数据类型:字符串(string)
- 名称:answer,数据类型:字符串(string)
数据集划分:
- 划分集名称:test,占用字节数:799855759,样本数量:1000
下载大小:672236428 字节
数据集总大小:799855759 字节
配置项:
- 配置名称:default,数据文件:
- 划分集:test,文件路径:data/test-*
许可证:CC BY-NC 4.0、MIT
任务类别:音频转文本(audio-text-to-text)
语言:英语(en)
样本规模区间:1000 < 样本数 < 10000(1K<n<10K)
展示名称:HaluEvalAudio 1000
# HaluEvalAudio 1000 数据集
<p align="center">
<img src="haluevalaudio_1000.png" width="500" alt="HaluEvalAudio 1000 标识">
</p>
## 数据集描述
**HaluEvalAudio 1000** 是一款专为语音问答任务设计的专项数据集,旨在评测通用多模态与音频聚焦大语言模型(general multimodal & audio-focused language models)以及检索增强型音频语言模型(retrieval-augmented audio language models)的性能。相较于Llama Questions、Web Questions、TriviaQA等常见问答基准数据集,HaluEvalAudio 1000设计了更具挑战性的问题与话题,并专门针对检索增强生成(Retrieval-Augmented Generation, RAG)评测进行了结构化构建。
该数据集衍生自[HaluEval数据集(HaluEval dataset)](https://aclanthology.org/2023.emnlp-main.397/),其核心特色在于提供了文本格式的真实参考文本。这一设计支持两种互补的RAG模型评测方案:(1) 端到端RAG:利用模型内置的检索流水线完成任务;(2) 辅助参考生成:通过提供真实上下文作为消融实验变量,使研究者能够将检索质量与下游生成性能解耦分析。
---
## 数据集概览
* **来源**:HaluEval数据集的`qa`子集。
* **总样本数**:共1000个WAV格式音频文件,由Kyutai的文本转语音(Text-to-Speech, TTS)模型合成,每个音频均配有对应的参考知识与真实标准答案的文本形式数据。
---
## 数据格式
数据集中的每个条目包含以下字段:
- `audio`:包含口语化问题的合成WAV音频文件
- `text`:`audio`字段对应问题的文本转录内容
- `knowledge`:源自原始HaluEval数据集的真实文本参考知识
- `answer`:源自原始HaluEval数据集的真实标准答案文本
---
## 数据集构建流程
我们从HaluEval数据集的`qa`子集中提取了前1000个样本。使用Kyutai的多流TTS模型(该模型同样用于生成MoshiRAG的训练数据,但从不同数据集采样了语音音色),并从[Common Voice](https://www.mozillafoundation.org/en/common-voice/)数据集随机采样说话人音色,将文本形式的问题转换为音频文件。我们保留了原始HaluEval数据集中的文本参考知识与真实标准答案,为简化处理移除了`hallucinated_answer`(幻觉答案)字段。
---
## 引用说明
若使用本数据集,请引用以下文献:
bibtex
@misc{chien2026moshirag,
title={MoshiRAG: 面向全双工语音语言模型的异步知识检索},
author={Chung-Ming Chien、Manu Orsini、Eugene Kharitonov、Neil Zeghidour、Karen Livescu、Alexandre Défossez},
year={2026},
eprint={2604.12928},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.12928},
}
## 致谢与许可声明
本数据集为衍生作品。
音频文件采用CC BY-NC 4.0许可证。
源文本源自HaluEval数据集,该数据集采用MIT许可证(版权所有 © 2020 RUCAIBox)。
提供机构:
kyutai



