fleurs

Name: fleurs
Creator: maas
Published: 2026-05-23 23:15:45
License: 暂无描述

魔搭社区2026-05-23 更新2025-03-01 收录

下载链接：

https://modelscope.cn/datasets/pengzhendong/fleurs

下载链接

链接失效反馈

官方服务：

资源简介：

# FLEURS ## Dataset Description - **Fine-Tuning script:** [pytorch/speech-recognition](https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition) - **Paper:** [FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech](https://arxiv.org/abs/2205.12446) - **Total amount of disk used:** ca. 350 GB Fleurs is the speech version of the [FLoRes machine translation benchmark](https://arxiv.org/abs/2106.03193). We use 2009 n-way parallel sentences from the FLoRes dev and devtest publicly available sets, in 102 languages. Training sets have around 10 hours of supervision. Speakers of the train sets are different than speakers from the dev/test sets. Multilingual fine-tuning is used and ”unit error rate” (characters, signs) of all languages is averaged. Languages and results are also grouped into seven geographical areas: - **Western Europe**: *Asturian, Bosnian, Catalan, Croatian, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hungarian, Icelandic, Irish, Italian, Kabuverdianu, Luxembourgish, Maltese, Norwegian, Occitan, Portuguese, Spanish, Swedish, Welsh* - **Eastern Europe**: *Armenian, Belarusian, Bulgarian, Czech, Estonian, Georgian, Latvian, Lithuanian, Macedonian, Polish, Romanian, Russian, Serbian, Slovak, Slovenian, Ukrainian* - **Central-Asia/Middle-East/North-Africa**: *Arabic, Azerbaijani, Hebrew, Kazakh, Kyrgyz, Mongolian, Pashto, Persian, Sorani-Kurdish, Tajik, Turkish, Uzbek* - **Sub-Saharan Africa**: *Afrikaans, Amharic, Fula, Ganda, Hausa, Igbo, Kamba, Lingala, Luo, Northern-Sotho, Nyanja, Oromo, Shona, Somali, Swahili, Umbundu, Wolof, Xhosa, Yoruba, Zulu* - **South-Asia**: *Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Nepali, Oriya, Punjabi, Sindhi, Tamil, Telugu, Urdu* - **South-East Asia**: *Burmese, Cebuano, Filipino, Indonesian, Javanese, Khmer, Lao, Malay, Maori, Thai, Vietnamese* - **CJK languages**: *Cantonese and Mandarin Chinese, Japanese, Korean* ## How to use & Supported Tasks ### How to use The `datasets` library allows you to load and pre-process your dataset in pure Python, at scale. The dataset can be downloaded and prepared in one call to your local drive by using the `load_dataset` function. For example, to download the Hindi config, simply specify the corresponding language config name (i.e., "hi_in" for Hindi): ```python from datasets import load_dataset fleurs = load_dataset("google/fleurs", "hi_in", split="train") ``` Using the datasets library, you can also stream the dataset on-the-fly by adding a `streaming=True` argument to the `load_dataset` function call. Loading a dataset in streaming mode loads individual samples of the dataset at a time, rather than downloading the entire dataset to disk. ```python from datasets import load_dataset fleurs = load_dataset("google/fleurs", "hi_in", split="train", streaming=True) print(next(iter(fleurs))) ``` *Bonus*: create a [PyTorch dataloader](https://huggingface.co/docs/datasets/use_with_pytorch) directly with your own datasets (local/streamed). Local: ```python from datasets import load_dataset from torch.utils.data.sampler import BatchSampler, RandomSampler fleurs = load_dataset("google/fleurs", "hi_in", split="train") batch_sampler = BatchSampler(RandomSampler(fleurs), batch_size=32, drop_last=False) dataloader = DataLoader(fleurs, batch_sampler=batch_sampler) ``` Streaming: ```python from datasets import load_dataset from torch.utils.data import DataLoader fleurs = load_dataset("google/fleurs", "hi_in", split="train") dataloader = DataLoader(fleurs, batch_size=32) ``` To find out more about loading and preparing audio datasets, head over to [hf.co/blog/audio-datasets](https://huggingface.co/blog/audio-datasets). ### Example scripts Train your own CTC or Seq2Seq Automatic Speech Recognition models on FLEURS with `transformers` - [here](https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition). Fine-tune your own Language Identification models on FLEURS with `transformers` - [here](https://github.com/huggingface/transformers/tree/main/examples/pytorch/audio-classification) ### 1. Speech Recognition (ASR) ```py from datasets import load_dataset fleurs_asr = load_dataset("google/fleurs", "af_za") # for Afrikaans # to download all data for multi-lingual fine-tuning uncomment following line # fleurs_asr = load_dataset("google/fleurs", "all") # see structure print(fleurs_asr) # load audio sample on the fly audio_input = fleurs_asr["train"][0]["audio"] # first decoded audio sample transcription = fleurs_asr["train"][0]["transcription"] # first transcription # use `audio_input` and `transcription` to fine-tune your model for ASR # for analyses see language groups all_language_groups = fleurs_asr["train"].features["lang_group_id"].names lang_group_id = fleurs_asr["train"][0]["lang_group_id"] all_language_groups[lang_group_id] ``` ### 2. Language Identification LangID can often be a domain classification, but in the case of FLEURS-LangID, recordings are done in a similar setting across languages and the utterances correspond to n-way parallel sentences, in the exact same domain, making this task particularly relevant for evaluating LangID. The setting is simple, FLEURS-LangID is splitted in train/valid/test for each language. We simply create a single train/valid/test for LangID by merging all. ```py from datasets import load_dataset fleurs_langID = load_dataset("google/fleurs", "all") # to download all data # see structure print(fleurs_langID) # load audio sample on the fly audio_input = fleurs_langID["train"][0]["audio"] # first decoded audio sample language_class = fleurs_langID["train"][0]["lang_id"] # first id class language = fleurs_langID["train"].features["lang_id"].names[language_class] # use audio_input and language_class to fine-tune your model for audio classification ``` ### 3. Retrieval Retrieval provides n-way parallel speech and text data. Similar to how XTREME for text leverages Tatoeba to evaluate bitext mining a.k.a sentence translation retrieval, we use Retrieval to evaluate the quality of fixed-size representations of speech utterances. Our goal is to incentivize the creation of fixed-size speech encoder for speech retrieval. The system has to retrieve the English "key" utterance corresponding to the speech translation of "queries" in 15 languages. Results have to be reported on the test sets of Retrieval whose utterances are used as queries (and keys for English). We augment the English keys with a large number of utterances to make the task more difficult. ```py from datasets import load_dataset fleurs_retrieval = load_dataset("google/fleurs", "af_za") # for Afrikaans # to download all data for multi-lingual fine-tuning uncomment following line # fleurs_retrieval = load_dataset("google/fleurs", "all") # see structure print(fleurs_retrieval) # load audio sample on the fly audio_input = fleurs_retrieval["train"][0]["audio"] # decoded audio sample text_sample_pos = fleurs_retrieval["train"][0]["transcription"] # positive text sample text_sample_neg = fleurs_retrieval["train"][1:20]["transcription"] # negative text samples # use `audio_input`, `text_sample_pos`, and `text_sample_neg` to fine-tune your model for retrieval ``` Users can leverage the training (and dev) sets of FLEURS-Retrieval with a ranking loss to build better cross-lingual fixed-size representations of speech. ## Dataset Structure We show detailed information the example configurations `af_za` of the dataset. All other configurations have the same structure. ### Data Instances **af_za** - Size of downloaded dataset files: 1.47 GB - Size of the generated dataset: 1 MB - Total amount of disk used: 1.47 GB An example of a data instance of the config `af_za` looks as follows: ``` {'id': 91, 'num_samples': 385920, 'path': '/home/patrick/.cache/huggingface/datasets/downloads/extracted/310a663d52322700b3d3473cbc5af429bd92a23f9bc683594e70bc31232db39e/home/vaxelrod/FLEURS/oss2_obfuscated/af_za/audio/train/17797742076841560615.wav', 'audio': {'path': '/home/patrick/.cache/huggingface/datasets/downloads/extracted/310a663d52322700b3d3473cbc5af429bd92a23f9bc683594e70bc31232db39e/home/vaxelrod/FLEURS/oss2_obfuscated/af_za/audio/train/17797742076841560615.wav', 'array': array([ 0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ..., -1.1205673e-04, -8.4638596e-05, -1.2731552e-04], dtype=float32), 'sampling_rate': 16000}, 'raw_transcription': 'Dit is nog nie huidiglik bekend watter aantygings gemaak sal word of wat owerhede na die seun gelei het nie maar jeugmisdaad-verrigtinge het in die federale hof begin', 'transcription': 'dit is nog nie huidiglik bekend watter aantygings gemaak sal word of wat owerhede na die seun gelei het nie maar jeugmisdaad-verrigtinge het in die federale hof begin', 'gender': 0, 'lang_id': 0, 'language': 'Afrikaans', 'lang_group_id': 3} ``` ### Data Fields The data fields are the same among all splits. - **id** (int): ID of audio sample - **num_samples** (int): Number of float values - **path** (str): Path to the audio file - **audio** (dict): Audio object including loaded audio array, sampling rate and path ot audio - **raw_transcription** (str): The non-normalized transcription of the audio file - **transcription** (str): Transcription of the audio file - **gender** (int): Class id of gender - **lang_id** (int): Class id of language - **lang_group_id** (int): Class id of language group ### Data Splits Every config only has the `"train"` split containing of *ca.* 1000 examples, and a `"validation"` and `"test"` split each containing of *ca.* 400 examples. ## Dataset Creation We collect between one and three recordings for each sentence (2.3 on average), and buildnew train-dev-test splits with 1509, 150 and 350 sentences for train, dev and test respectively. ## Considerations for Using the Data ### Social Impact of Dataset This dataset is meant to encourage the development of speech technology in a lot more languages of the world. One of the goal is to give equal access to technologies like speech recognition or speech translation to everyone, meaning better dubbing or better access to content from the internet (like podcasts, streaming or videos). ### Discussion of Biases Most datasets have a fair distribution of gender utterances (e.g. the newly introduced FLEURS dataset). While many languages are covered from various regions of the world, the benchmark misses many languages that are all equally important. We believe technology built through FLEURS should generalize to all languages. ### Other Known Limitations The dataset has a particular focus on read-speech because common evaluation benchmarks like CoVoST-2 or LibriSpeech evaluate on this type of speech. There is sometimes a known mismatch between performance obtained in a read-speech setting and a more noisy setting (in production for instance). Given the big progress that remains to be made on many languages, we believe better performance on FLEURS should still correlate well with actual progress made for speech understanding. ## Additional Information All datasets are licensed under the [Creative Commons license (CC-BY)](https://creativecommons.org/licenses/). ### Citation Information You can access the FLEURS paper at https://arxiv.org/abs/2205.12446. Please cite the paper when referencing the FLEURS corpus as: ``` @article{fleurs2022arxiv, title = {FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech}, author = {Conneau, Alexis and Ma, Min and Khanuja, Simran and Zhang, Yu and Axelrod, Vera and Dalmia, Siddharth and Riesa, Jason and Rivera, Clara and Bapna, Ankur}, journal={arXiv preprint arXiv:2205.12446}, url = {https://arxiv.org/abs/2205.12446}, year = {2022}, ``` ### Contributions Thanks to [@patrickvonplaten](https://github.com/patrickvonplaten) and [@aconneau](https://github.com/aconneau) for adding this dataset.

# FLEURS ## 数据集描述 - **微调脚本：** [pytorch/speech-recognition](https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition) - **相关论文：** [FLEURS：面向语音通用表征的少样本学习评估（FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech）](https://arxiv.org/abs/2205.12446) - **总磁盘占用量：** 约350 GB FLEURS是[FLoRes机器翻译基准（FLoRes machine translation benchmark）](https://arxiv.org/abs/2106.03193)的语音版本。我们从FLoRes的开发集与开发测试公开集合中选取了102种语言的2009条多向平行语句。训练集包含约10小时的标注语音数据，其说话人与开发/测试集的说话人互不重叠。本数据集采用多语言微调方案，并对所有语言的**单位错误率（Unit Error Rate，即字符、符号错误率）**取平均值。语言与实验结果被划分为七个地理区域： - **西欧：** *阿斯图里亚斯语、波斯尼亚语、加泰罗尼亚语、克罗地亚语、丹麦语、荷兰语、英语、芬兰语、法语、加利西亚语、德语、希腊语、匈牙利语、冰岛语、爱尔兰语、意大利语、卡布韦尔迪亚语、卢森堡语、马耳他语、挪威语、奥克语、葡萄牙语、西班牙语、瑞典语、威尔士语* - **东欧：** *亚美尼亚语、白俄罗斯语、保加利亚语、捷克语、爱沙尼亚语、格鲁吉亚语、拉脱维亚语、立陶宛语、马其顿语、波兰语、罗马尼亚语、俄语、塞尔维亚语、斯洛伐克语、斯洛文尼亚语、乌克兰语* - **中亚/中东/北非：** *阿拉伯语、阿塞拜疆语、希伯来语、哈萨克语、吉尔吉斯语、蒙古语、普什图语、波斯语、索拉尼库尔德语、塔吉克语、土耳其语、乌兹别克语* - **撒哈拉以南非洲：** *南非荷兰语、阿姆哈拉语、富拉尼语、卢干达语、豪萨语、伊博语、坎巴语、林加拉语、卢奥语、北索托语、尼扬贾语、奥罗莫语、绍纳语、索马里语、斯瓦希里语、温邦杜语、沃洛夫语、科萨语、约鲁巴语、祖鲁语* - **南亚：** *阿萨姆语、孟加拉语、古吉拉特语、印地语、卡纳达语、马拉雅拉姆语、马拉地语、尼泊尔语、奥里亚语、旁遮普语、信德语、泰米尔语、泰卢固语、乌尔都语* - **东南亚：** *缅甸语、宿务语、菲律宾语、印度尼西亚语、爪哇语、高棉语、老挝语、马来语、毛利语、泰语、越南语* - **中、日、韩语言：** *粤语与普通话中文、日语、韩语* ## 使用方法与支持任务 ### 使用方法 `datasets`库支持你通过纯Python代码规模化加载与预处理数据集。仅需调用一次`load_dataset`函数，即可将数据集下载并预处理至本地磁盘。例如，若要下载印地语配置，仅需指定对应的语言配置名称（即印地语对应`hi_in`）： python from datasets import load_dataset fleurs = load_dataset("google/fleurs", "hi_in", split="train") 借助`datasets`库，你还可以通过在`load_dataset`函数调用中添加`streaming=True`参数来实时流式加载数据集。流式加载模式下，数据集将单次加载单个样本，而非将完整数据集下载至本地磁盘。 python from datasets import load_dataset fleurs = load_dataset("google/fleurs", "hi_in", split="train", streaming=True) print(next(iter(fleurs))) **拓展技巧：** 可直接结合自有数据集（本地或流式加载）创建[PyTorch数据加载器（PyTorch dataloader）](https://huggingface.co/docs/datasets/use_with_pytorch)。本地加载模式示例： python from datasets import load_dataset from torch.utils.data.sampler import BatchSampler, RandomSampler fleurs = load_dataset("google/fleurs", "hi_in", split="train") batch_sampler = BatchSampler(RandomSampler(fleurs), batch_size=32, drop_last=False) dataloader = DataLoader(fleurs, batch_sampler=batch_sampler) 流式加载模式示例： python from datasets import load_dataset from torch.utils.data import DataLoader fleurs = load_dataset("google/fleurs", "hi_in", split="train") dataloader = DataLoader(fleurs, batch_size=32) 若需了解更多关于音频数据集加载与预处理的内容，请访问[hf.co/blog/audio-datasets](https://huggingface.co/blog/audio-datasets)。 ### 示例脚本借助`transformers`库在FLEURS数据集上训练自定义连接主义时序分类（CTC）或序列到序列（Seq2Seq）自动语音识别模型——[点击此处](https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition)。借助`transformers`库在FLEURS数据集上微调自定义语言识别模型——[点击此处](https://github.com/huggingface/transformers/tree/main/examples/pytorch/audio-classification) ### 1. 自动语音识别（ASR） py from datasets import load_dataset fleurs_asr = load_dataset("google/fleurs", "af_za") # 针对南非荷兰语 # 若需下载全量数据用于多语言微调，请取消下一行注释 # fleurs_asr = load_dataset("google/fleurs", "all") # 查看数据集结构 print(fleurs_asr) # 实时加载音频样本 audio_input = fleurs_asr["train"][0]["audio"] # 第一条解码后的音频样本 transcription = fleurs_asr["train"][0]["transcription"] # 第一条转录文本 # 使用`audio_input`与`transcription`微调你的自动语音识别模型 # 可通过语言组字段进行分析 all_language_groups = fleurs_asr["train"].features["lang_group_id"].names lang_group_id = fleurs_asr["train"][0]["lang_group_id"] all_language_groups[lang_group_id] ### 2. 语言识别语言识别（LangID）通常可视为领域分类任务，但针对FLEURS-LangID而言，所有语言的录音采集场景均保持一致，且语音片段对应同领域的多向平行语句，这使得该任务成为评估语言识别模型的理想基准。任务设置十分简洁：FLEURS-LangID为每种语言分别划分了训练/验证/测试集，我们通过合并所有语言的数据集得到统一的语言识别任务训练/验证/测试集。 py from datasets import load_dataset fleurs_langID = load_dataset("google/fleurs", "all") # 下载全量数据 # 查看数据集结构 print(fleurs_langID) # 实时加载音频样本 audio_input = fleurs_langID["train"][0]["audio"] # 第一条解码后的音频样本 language_class = fleurs_langID["train"][0]["lang_id"] # 第一条语言类别标签 language = fleurs_langID["train"].features["lang_id"].names[language_class] # 使用`audio_input`与`language_class`微调你的音频分类模型 ### 3. 检索任务检索任务提供多向平行的语音与文本数据。正如文本领域的XTREME基准借助Tatoeba数据集评估双文本挖掘（即句子翻译检索）任务表现一样，我们通过检索任务来评估语音片段固定长度表征的质量。本任务旨在推动适用于语音检索的固定长度语音编码器的研发。模型需要从15种语言的“查询”语音翻译结果中，检索出对应的英语“关键”语音片段。检索任务的测试集以非英语语音片段作为查询样本，英语语音片段则作为关键样本，相关实验结果需基于该测试集报告。我们通过添加大量额外语音片段来扩充英语关键样本集，以提升任务难度。 py from datasets import load_dataset fleurs_retrieval = load_dataset("google/fleurs", "af_za") # 针对南非荷兰语 # 若需下载全量数据用于多语言微调，请取消下一行注释 # fleurs_retrieval = load_dataset("google/fleurs", "all") # 查看数据集结构 print(fleurs_retrieval) # 实时加载音频样本 audio_input = fleurs_retrieval["train"][0]["audio"] # 第一条解码后的音频样本 text_sample_pos = fleurs_retrieval["train"][0]["transcription"] # 正样本文本 text_sample_neg = fleurs_retrieval["train"][1:20]["transcription"] # 负样本文本 # 使用`audio_input`、`text_sample_pos`与`text_sample_neg`微调你的检索模型用户可借助FLEURS-Retrieval的训练集（与开发集）结合排序损失函数，优化跨语言固定长度语音表征的建模效果。 ## 数据集结构我们将以`af_za`配置为例展示数据集的详细结构，其余所有配置的结构均保持一致。 ### 数据样本 **af_za** - 下载后的数据集文件大小：1.47 GB - 预处理后生成的数据集大小：1 MB - 总磁盘占用量：1.47 GB `af_za`配置下的一条数据样本示例如下： {'id': 91, 'num_samples': 385920, 'path': '/home/patrick/.cache/huggingface/datasets/downloads/extracted/310a663d52322700b3d3473cbc5af429bd92a23f9bc683594e70bc31232db39e/home/vaxelrod/FLEURS/oss2_obfuscated/af_za/audio/train/17797742076841560615.wav', 'audio': {'path': '/home/patrick/.cache/huggingface/datasets/downloads/extracted/310a663d52322700b3d3473cbc5af429bd92a23f9bc683594e70bc31232db39e/home/vaxelrod/FLEURS/oss2_obfuscated/af_za/audio/train/17797742076841560615.wav', 'array': array([ 0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ..., -1.1205673e-04, -8.4638596e-05, -1.2731552e-04], dtype=float32), 'sampling_rate': 16000}, 'raw_transcription': 'Dit is nog nie huidiglik bekend watter aantygings gemaak sal word of wat owerhede na die seun gelei het nie maar jeugmisdaad-verrigtinge het in die federale hof begin', 'transcription': 'dit is nog nie huidiglik bekend watter aantygings gemaak sal word of wat owerhede na die seun gelei het nie maar jeugmisdaad-verrigtinge het in die federale hof begin', 'gender': 0, 'lang_id': 0, 'language': 'Afrikaans', 'lang_group_id': 3} ### 数据字段所有划分的数据字段均保持一致： - **id**（整数型）：音频样本的唯一标识符 - **num_samples**（整数型）：音频浮点数值的总数量 - **path**（字符串型）：音频文件的存储路径 - **audio**（字典型）：音频对象，包含加载后的音频数组、采样率与音频文件路径 - **raw_transcription**（字符串型）：音频文件的未归一化转录文本 - **transcription**（字符串型）：音频文件的归一化转录文本 - **gender**（整数型）：说话人性别的类别标签 - **lang_id**（整数型）：语言的类别标签 - **lang_group_id**（整数型）：语言组的类别标签 ### 数据划分每个配置仅包含`train`（训练集，约1000条样本）、`validation`（验证集）与`test`（测试集）三个划分，其中验证集与测试集各包含约400条样本。 ## 数据集构建我们为每条语句采集1至3条录音（平均2.3条），并为训练、验证与测试集分别构建了包含1509、150与350条语句的数据划分。 ## 数据集使用注意事项 ### 数据集的社会价值本数据集旨在推动全球更多语言的语音技术发展，其核心目标之一是让所有人都能平等获取语音识别、语音翻译等技术，进而实现更好的内容配音效果，或是提升用户访问互联网内容（如播客、流媒体与视频）的便捷性。 ### 数据集偏差说明多数数据集的性别语音样本分布较为均衡（如本次发布的FLEURS数据集）。尽管本基准覆盖了全球多个地区的众多语言，但仍有大量同等重要的语言未被纳入。我们认为，基于FLEURS数据集研发的语音技术应当具备对所有语言的泛化能力。 ### 已知局限性由于CoVoST-2、LibriSpeech等主流评估基准均基于朗读语音进行测试，本数据集也重点聚焦于朗读语音场景。但朗读场景下的模型性能与真实嘈杂场景（如工业生产环境）下的表现有时会存在差异。尽管针对多数语言的语音技术仍有巨大的进步空间，但我们认为，在FLEURS数据集上取得更优的性能表现，依然与语音理解技术的实际进步高度相关。 ## 补充信息所有数据集均采用[知识共享署名许可协议（CC-BY）](https://creativecommons.org/licenses/)进行授权。 ### 引用信息你可通过https://arxiv.org/abs/2205.12446访问FLEURS相关论文。若需引用FLEURS数据集，请按以下格式标注： @article{fleurs2022arxiv, title = {FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech}, author = {Conneau, Alexis and Ma, Min and Khanuja, Simran and Zhang, Yu and Axelrod, Vera and Dalmia, Siddharth and Riesa, Jason and Rivera, Clara and Bapna, Ankur}, journal={arXiv preprint arXiv:2205.12446}, url = {https://arxiv.org/abs/2205.12446}, year = {2022}, ### 贡献者感谢[@patrickvonplaten](https://github.com/patrickvonplaten)与[@aconneau](https://github.com/aconneau)为本数据集的加入所做的贡献。

提供机构：

maas

创建时间：

2025-02-25

搜集汇总

数据集介绍