five

soynade-research/kallaama-retrival-eval-qrels

收藏
Hugging Face2026-02-09 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/soynade-research/kallaama-retrival-eval-qrels
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: query-id dtype: string - name: corpus-id dtype: string splits: - name: train num_bytes: 36000 num_examples: 450 download_size: 24391 dataset_size: 36000 configs: - config_name: default data_files: - split: train path: data/train-* --- ### Kallaama-Retrieval-Eval **Kallaama-Retrieval-Eval** is a cross-lingual speech-to-text retrieval evaluation dataset designed to assess French document retrieval from Wolof speech queries in natural conversational settings. It is derived from the test split of the Kallaama corpus, a professionally transcribed dataset of spontaneous speech by native Wolof speakers. Unlike Fleurs-Retrieval-Eval, which is based on read speech, Kallaama-Retrieval-Eval relies on naturally occurring spoken interactions. This makes it more representative of real-world usage scenarios, where users formulate queries spontaneously and with natural prosody. From the Kallaama test split, the authors select the 150 longest speech segments that do not exceed 30 seconds, ensuring that the queries are information-rich while remaining compatible with many modern speech model input constraints. The corresponding Wolof transcriptions are translated into French, and this translation is used to generate synthetic French documents using a large language model. Three document types are produced for each query: dialogues, blog posts, and short stories. The resulting dataset contains: * **150 Wolof speech queries**, * **445 French documents** forming the retrieval corpus, * Multiple relevant documents per query, reflecting realistic retrieval scenarios where several documents may match the same information need. Each example includes: * A Wolof audio query, * Its transcription, * A French translation, * Three associated synthetic French documents. Kallaama-Retrieval-Eval is intended for benchmarking cross-lingual and cross-modal retrieval systems under natural speech conditions. Compared to Fleurs-Retrieval-Eval, it features higher speech quality, greater fluency, and higher volume, which leads to more reliable semantic representations and improved retrieval performance. The dataset is primarily evaluated using ranking-based metrics such as nDCG@k and supports research on speech-driven information access for low-resource and primarily oral languages. ### Kallaama-Retrieval-Eval-Qrels This repository contains the mapping between documents to retrieve and queries. The associated queries are here: [soynade-research/kallaama-retrival-eval-queries](https://huggingface.co/datasets/soynade-research/kallaama-retrival-eval-queries). The documents are in this repository: [soynade-research/kallaama-retrival-eval-corpus](https://huggingface.co/datasets/soynade-research/kallaama-retrival-eval-corpus/)

数据集信息: 特征: - 字段名:查询ID(query-id),数据类型:字符串 - 字段名:语料库ID(corpus-id),数据类型:字符串 数据集划分: - 划分名称:train,数据字节数:36000,样本数量:450 下载大小:24391,数据集总大小:36000 配置项: - 配置名称:default,数据文件: - 划分:train,路径:data/train-* ### Kallaama检索评估数据集(Kallaama-Retrieval-Eval) **Kallaama检索评估数据集(Kallaama-Retrieval-Eval)** 是一款跨语言语音转文本检索评估数据集,旨在评估自然会话场景下基于沃洛夫语语音查询的法语文档检索任务。该数据集源自Kallaama语料库的测试划分,而Kallaama语料库是一份由母语为沃洛夫语的说话者录制的自发语音专业转录数据集。 与基于朗读语音的Fleurs检索评估数据集(Fleurs-Retrieval-Eval)不同,Kallaama检索评估数据集采用真实发生的口语交互数据,这使其更贴合现实使用场景——用户以自然韵律自发组织查询语句。 研究人员从Kallaama语料库的测试划分中,筛选出150段时长不超过30秒的最长语音片段,确保查询语句信息丰富的同时,兼容多数现代语音模型的输入限制。对应的沃洛夫语文本转录本会被翻译为法语,再通过大语言模型(Large Language Model,LLM)生成合成法语文档。每个查询对应三种文档类型:对话、博客文章与短篇故事。 最终构建的数据集包含: * **150条沃洛夫语语音查询**, * **445条法语文档**构成检索语料库, * 每个查询对应多篇相关文档,贴合现实检索场景中多条文档可匹配同一信息需求的情况。 每条样本包含: * 一条沃洛夫语语音查询, * 对应的转录文本, * 法语翻译结果, * 三篇关联的合成法语文档。 Kallaama检索评估数据集旨在针对自然语音条件下的跨语言、跨模态检索系统开展基准测试。与Fleurs检索评估数据集相比,该数据集的语音质量更高、流畅性更强、数据量更大,能够生成更可靠的语义表征,进而提升检索性能。 该数据集主要基于排序的评测指标(如nDCG@k)开展评估,可支撑低资源且以口语为主要表达形式的语言的语音驱动信息获取相关研究。 ### Kallaama检索评估数据集相关性标注集(Kallaama-Retrieval-Eval-Qrels) 本仓库包含待检索文档与查询之间的映射关系。关联的查询数据集可在此获取:[soynade-research/kallaama-retrival-eval-queries](https://huggingface.co/datasets/soynade-research/kallaama-retrival-eval-queries)。文档数据集存储于本仓库内:[soynade-research/kallaama-retrival-eval-corpus](https://huggingface.co/datasets/soynade-research/kallaama-retrival-eval-corpus/)
提供机构:
soynade-research
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作