soynade-research/kallaama-retrival-eval-queries

Name: soynade-research/kallaama-retrival-eval-queries
Creator: soynade-research
Published: 2026-02-09 10:45:50
License: 暂无描述

Hugging Face2026-02-09 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/soynade-research/kallaama-retrival-eval-queries

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: _id dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: text dtype: string splits: - name: train num_bytes: 101208309.0 num_examples: 150 download_size: 100802662 dataset_size: 101208309.0 configs: - config_name: default data_files: - split: train path: data/train-* --- ### Kallaama-Retrieval-Eval **Kallaama-Retrieval-Eval** is a cross-lingual speech-to-text retrieval evaluation dataset designed to assess French document retrieval from Wolof speech queries in natural conversational settings. It is derived from the test split of the Kallaama corpus, a professionally transcribed dataset of spontaneous speech by native Wolof speakers. Unlike Fleurs-Retrieval-Eval, which is based on read speech, Kallaama-Retrieval-Eval relies on naturally occurring spoken interactions. This makes it more representative of real-world usage scenarios, where users formulate queries spontaneously and with natural prosody. From the Kallaama test split, the authors select the 150 longest speech segments that do not exceed 30 seconds, ensuring that the queries are information-rich while remaining compatible with many modern speech model input constraints. The corresponding Wolof transcriptions are translated into French, and this translation is used to generate synthetic French documents using a large language model. Three document types are produced for each query: dialogues, blog posts, and short stories. The resulting dataset contains: * **150 Wolof speech queries**, * **445 French documents** forming the retrieval corpus, * Multiple relevant documents per query, reflecting realistic retrieval scenarios where several documents may match the same information need. Each example includes: * A Wolof audio query, * Its transcription, * A French translation, * Three associated synthetic French documents. Kallaama-Retrieval-Eval is intended for benchmarking cross-lingual and cross-modal retrieval systems under natural speech conditions. Compared to Fleurs-Retrieval-Eval, it features higher speech quality, greater fluency, and higher volume, which leads to more reliable semantic representations and improved retrieval performance. The dataset is primarily evaluated using ranking-based metrics such as nDCG@k and supports research on speech-driven information access for low-resource and primarily oral languages. ### Kallaama-Retrieval-Eval-Queries This repository contains the queries. The associated documents are here: [soynade-research/kallaama-retrival-eval-corpus](https://huggingface.co/datasets/soynade-research/kallaama-retrival-eval-corpus). The mapping between queries and documents is in this repository: [soynade-research/kallaama-retrival-eval-qrels](https://huggingface.co/datasets/soynade-research/kallaama-retrival-eval-qrels/)

提供机构：

soynade-research

5,000+

优质数据集

54 个

任务类型

进入经典数据集