sib-fleurs

Hugging Face2024-12-12 更新2024-12-13 收录

下载链接：

https://huggingface.co/datasets/WueNLP/sib-fleurs

下载链接

链接失效反馈

官方服务：

资源简介：

SIB-Fleurs数据集是一个多语言的语音和文本数据集，支持多种语言，涵盖了音频分类、自动语音识别、文本到语音、问答等多种任务。数据集包含多个配置，每个配置都有详细的特征描述，如句子、URL、ID、领域、主题、是否有图像或超链接、转录文本、音频样本等。数据集分为训练、验证和测试集，每个集都有相应的大小和样本数量。

创建时间：

2024-12-04

原始信息汇总

SIB-Fleurs 数据集概述

基本信息

许可证: CC BY-SA 4.0
语言: 包含多种语言，如ace, acm, acq, aeb, af, ajp, ak, als, am, apc, ar, ars, ary, arz, as, ast, awa, ayr, azb, azj, ba, bm, ban, be, bem, bn, bho, bjn, bo, bs, bug, bg, ca, ceb, cs, cjk, ckb, crh, cy, da, de, dik, dyu, dz, el, en, eo, et, eu, ee, fo, fj, fi, fon, fr, fur, fuv, gaz, gd, ga, gl, gn, gu, ht, ha, he, hi, hne, hr, hu, hy, ig, ilo, id, is, it, jv, ja, kab, kac, kam, kn, ks, ka, kk, kbp, kea, khk, km, ki, rw, ky, kmb, kmr, knc, kg, ko, lo, lij, li, ln, lt, lmo, ltg, lb, lua, lg, luo, lus, lvs, mag, mai, ml, mar, min, mk, mt, mni, mos, mi, my, nl, nn, nb, npi, nqo, nso, nus, ny, oc, ory, pag, pa, pap, pbt, pes, plt, pl, pt, prs, quy, ro, rn, ru, sg, sa, sat, scn, shn, si, sk, sl, sm, sn, sd, so, st, es, sc, sr, ss, su, sv, swh, szl, ta, taq, tt, te, tg, tl, th, ti, tpi, tn, ts, tk, tum, tr, tw, tzm, ug, uk, umb, ur, uzn, vec, vi, war, wo, xh, ydd, yo, yue, zh, zsm, zu, multilingual。
注释创建者: found
语言创建者: expert-generated
多语言性: 多语言

任务类别

音频分类
自动语音识别
音频文本到文本
文本到语音
问答
文档问答

数据集配置

配置名称: afr_Latn

特征:
- sentence: string
- URL: string
- id: int32
- domain: string
- topic: string
- has_image: int32
- has_hyperlink: int32
- fleurs_id: int32
- filename: sequence of string
- raw_transcription: string
- transcription: string
- num_samples: sequence of int64
- speaker_id: sequence of int64
- gender: sequence of string
- whisper_asr: sequence of string
- whisper_asr_cer: sequence of float64
- whisper_asr_wer: sequence of float64
- whisper_asr_translation: sequence of string
- seamlessm4t_asr: sequence of string
- seamlessm4t_asr_cer: sequence of float64
- seamlessm4t_asr_wer: sequence of float64
- seamlessm4t_asr_translation: sequence of string
- index_id: int64
- category: class_label (names: science/technology, travel, politics, sports, health, entertainment, geography)
- text: string
- audio: sequence of audio (sampling_rate: 16000)
分割:
- train: 406 examples, 524232877 bytes
- validation: 86 examples, 76384271 bytes
- test: 95 examples, 84400076 bytes
下载大小: 673661100 bytes
数据集大小: 685017224 bytes

配置名称: amh_Ethi

特征:
- sentence: string
- URL: string
- id: int32
- domain: string
- topic: string
- has_image: int32
- has_hyperlink: int32
- fleurs_id: int32
- filename: sequence of string
- raw_transcription: string
- transcription: string
- num_samples: sequence of int64
- speaker_id: sequence of int64
- gender: sequence of string
- whisper_asr: sequence of string
- whisper_asr_cer: sequence of float64
- whisper_asr_wer: sequence of float64
- whisper_asr_translation: sequence of string
- seamlessm4t_asr: sequence of string
- seamlessm4t_asr_cer: sequence of float64
- seamlessm4t_asr_wer: sequence of float64
- seamlessm4t_asr_translation: sequence of string
- index_id: int64
- category: class_label (names: science/technology, travel, politics, sports, health, entertainment, geography)
- text: string
- audio: sequence of audio (sampling_rate: 16000)
分割:
- train: 752 examples, 1289823377 bytes
- validation: 54 examples, 65389982 bytes
- test: 149 examples, 185857834 bytes
下载大小: 1525564166 bytes
数据集大小: 1541071193 bytes

配置名称: arb_Arab

特征:
- sentence: string
- URL: string
- id: int32
- domain: string
- topic: string
- has_image: int32
- has_hyperlink: int32
- fleurs_id: int32
- filename: sequence of string
- raw_transcription: string
- transcription: string
- num_samples: sequence of int64
- speaker_id: sequence of int64
- gender: sequence of string
- whisper_asr: sequence of string
- whisper_asr_cer: sequence of float64
- whisper_asr_wer: sequence of float64
- whisper_asr_translation: sequence of string
- seamlessm4t_asr: sequence of string
- seamlessm4t_asr_cer: sequence of float64
- seamlessm4t_asr_wer: sequence of float64
- seamlessm4t_asr_translation: sequence of string
- index_id: int64
- category: class_label (names: science/technology, travel, politics, sports, health, entertainment, geography)
- text: string
- audio: sequence of audio (sampling_rate: 16000)
分割:
- train: 579 examples, 646819902 bytes
- validation: 64 examples, 95091075 bytes
- test: 133 examples, 144786307 bytes
下载大小: 878867591 bytes
数据集大小: 886697284 bytes

配置名称: asm_Beng

特征:
- sentence: string
- URL: string
- id: int32
- domain: string
- topic: string
- has_image: int32
- has_hyperlink: int32
- fleurs_id: int32
- filename: sequence of string
- raw_transcription: string
- transcription: string
- num_samples: sequence of int64
- speaker_id: sequence of int64
- gender: sequence of string
- whisper_asr: sequence of string
- whisper_asr_cer: sequence of float64
- whisper_asr_wer: sequence of float64
- whisper_asr_translation: sequence of string
- seamlessm4t_asr: sequence of string
- seamlessm4t_asr_cer: sequence of float64
- seamlessm4t_asr_wer: sequence of float64
- seamlessm4t_asr_translation: sequence of string
- index_id: int64
- category: class_label (names: science/technology, travel, politics, sports, health, entertainment, geography)
- text: string
- audio: sequence of audio (sampling_rate: 16000)
分割:
- train: 730 examples, 1235366957 bytes
- validation: 71 examples, 158536549 bytes
- test: 176 examples, 400145792 bytes
下载大小: 1782426273 bytes
数据集大小: 1794049298 bytes

配置名称: ast_Latn

特征:
- sentence: string
- URL: string
- id: int32
- domain: string
- topic: string
- has_image: int32
- has_hyperlink: int32
- fleurs_id: int32
- filename: sequence of string
- raw_transcription: string
- transcription: string
- num_samples: sequence of int64
- speaker_id: sequence of int64
- gender: sequence of string
- whisper_asr: sequence of string
- whisper_asr_cer: sequence of float64
- whisper_asr_wer: sequence of float64
- whisper_asr_translation: sequence of string
- seamlessm4t_asr: sequence of string
- seamlessm4t_asr_cer: sequence of float64
- seamlessm4t_asr_wer: sequence of float64
- seamlessm4t_asr_translation: sequence of string
- index_id: int64
- category: class_label (names: science/technology, travel, politics, sports, health, entertainment, geography)
- text: string
- audio: sequence of audio (sampling_rate: 16000)
分割:
- train: 701 examples, 866679990 bytes
- validation: 69 examples, 102384453 bytes
- test: 177 examples, 282753773 bytes
下载大小: 1245085728 bytes
数据集大小: 1251818216 bytes

配置名称: azj_Latn

特征:
- sentence: string
- URL: string
- id: int32
- domain: string
- topic: string
- has_image: int32
- has_hyperlink: int32
- fleurs_id: int32
- filename: sequence of string
- raw_transcription: string
- transcription: string
- num_samples: sequence of int64
- speaker_id: sequence of int64
- gender: sequence of string
- whisper_asr: sequence of string
- whisper_asr_cer: sequence of float64
- whisper_asr_wer: sequence of float64
- whisper_asr_translation: sequence of string
- seamlessm4t_asr: sequence of string
- seamlessm4t_asr_cer: sequence of float64
- seamlessm4t_asr_wer: sequence of float64
- seamlessm4t_asr_translation: sequence of string
- index_id: int64
- category: class_label (names: science/technology, travel, politics, sports, health, entertainment, geography)
- text: string
- audio: sequence of audio (sampling_rate: 16000)
分割:
- train: 712 examples, 1090899299 bytes
- validation: 71 examples, 147617247 bytes
- test: 174 examples, 379234055 bytes
下载大小: 1602247163 bytes
数据集大小: 1617750601 bytes

配置名称: bel_Cyrl

特征:
- sentence: string
- URL: string
- id: int32
- domain: string
- topic: string
- has_image: int32
- has_hyperlink: int32
- fleurs_id: int32
- filename: sequence of string
- raw_transcription: string
- transcription: string
- num_samples: sequence of int64
- speaker_id: sequence of int64
- gender: sequence of string
- whisper_asr: sequence of string
- whisper_asr_cer: sequence of float64
- whisper_asr_wer: sequence of float64
- whisper_asr_translation: sequence of string
- seamlessm4t_asr: sequence of string
- seamlessm4t_asr_cer: sequence of float64
- seamlessm4t_asr_wer: sequence of float64
- seamlessm4t_asr_translation: sequence of string
- index_id: int64
- category: class_label (names: science/technology, travel, politics, sports, health, entertainment, geography)
- text: string
- audio: sequence of audio (sampling_rate: 16000)
分割:
- train: 690 examples, 1105817781 bytes
- validation: 71 examples, 186825266 bytes
- test: 177 examples, 486320479 bytes
下载大小: 1753989008 bytes
数据集大小: 1778963526 bytes

配置名称: ben_Beng

特征:
- sentence: string
- URL: string
- id: int32
- domain: string
- topic: string
- has_image: int32
- has_hyperlink: int32
- fleurs_id: int32
- filename: sequence of string
- raw_transcription: string
- transcription: string
- num_samples: sequence of int64
- speaker_id: sequence of int64
- gender: sequence of string
- whisper_asr: sequence of string
- whisper_asr_cer: sequence of float64
- whisper_asr_wer: sequence of float64
- whisper_asr_translation: sequence of string
- seamlessm4t_asr: sequence of string
- seamlessm4t_asr_cer: sequence of float64
- seamlessm4t_asr_wer: sequence of float64
- seamlessm4t_asr_translation: sequence of string
- index_id: int64
- category: class_label (names: science/technology, travel, politics, sports, health, entertainment, geography)
- text: string
- audio: sequence of audio (sampling_rate: 16000)
分割:
- train: 742 examples, 1232070743 bytes
- validation: 71 examples, 157285034 bytes
- test: 176 examples, 397951833 bytes
下载大小: 1782546384 bytes
数据集大小: 1787307610 bytes

配置名称: bos_Latn

特征:
- sentence: string
- URL: string
- id: int32
- domain: string
- topic: string
- has_image: int32
- has_hyperlink: int32
- fleurs_id: int32
- filename: sequence of string
- raw_transcription: string
- transcription: string
- num_samples: sequence of int64

搜集汇总

数据集介绍

构建方式

SIB-Fleurs数据集的构建基于多语言语音数据，涵盖了多种语言的语音样本。该数据集通过专家生成的标注方式，确保了语音转录的准确性。每个样本包含了语音文件、转录文本、语音识别结果及其对应的错误率等信息，旨在为语音识别、语音分类等任务提供丰富的训练和评估数据。

特点

SIB-Fleurs数据集的显著特点在于其多语言覆盖范围广泛，支持超过100种语言的语音数据。此外，数据集提供了详细的语音特征标注，包括语音识别结果、错误率、翻译文本等，使得研究者能够深入分析语音识别系统的性能。数据集还包含了不同领域的语音样本，如科技、旅行、政治等，增强了其在实际应用中的多样性。

使用方法

SIB-Fleurs数据集适用于多种语音处理任务，包括语音识别、语音分类、文本转语音等。用户可以通过加载数据集的配置文件，获取特定语言的语音数据，并利用提供的特征进行模型训练和评估。数据集的结构设计便于用户进行多语言语音处理的研究，尤其是在跨语言语音识别和翻译领域具有广泛的应用前景。

背景与挑战

背景概述

SIB-Fleurs数据集是一个多语言语音数据集，涵盖了多种语言的语音样本，旨在支持语音识别、语音分类、文本到语音转换等多种任务。该数据集由专家生成，包含丰富的语音特征和元数据，如转录文本、音频文件、说话者信息等。其创建时间未明确提及，但通过其多语言覆盖和任务多样性，可以看出该数据集在语音处理领域具有重要的研究价值。主要研究人员或机构未在提供的资料中明确，但其多语言特性和广泛的应用场景表明，该数据集可能由多个研究机构或团队合作开发。

当前挑战

SIB-Fleurs数据集面临的主要挑战包括多语言语音识别的复杂性，不同语言的语音特征和发音差异可能导致模型在跨语言任务中的表现不一致。此外，数据集的构建过程中，如何确保语音样本的质量和多样性，以及如何处理不同语言的标注和转录问题，都是需要克服的技术难题。最后，数据集的规模和多样性要求模型具备较强的泛化能力，以应对不同语言和任务的挑战。

常用场景

经典使用场景

SIB-Fleurs数据集在多语言语音识别领域展现了其卓越的应用潜力。该数据集涵盖了多种语言的语音数据，支持自动语音识别（ASR）、语音分类、文本转语音（TTS）等任务。其经典使用场景包括构建多语言语音识别模型，通过训练模型实现对不同语言的语音转文本功能，从而为跨语言交流和信息处理提供技术支持。

实际应用

在实际应用中，SIB-Fleurs数据集被广泛应用于多语言语音助手、跨语言翻译系统、语音文档处理等领域。例如，在跨国企业的会议记录中，该数据集支持的语音识别技术能够实时将不同语言的发言转换为文本，极大提高了会议效率和信息传递的准确性。此外，在教育领域，该数据集也为多语言学习提供了技术支持。

衍生相关工作

基于SIB-Fleurs数据集，许多相关研究工作得以展开。例如，研究人员利用该数据集开发了多语言语音识别模型，显著提升了模型在不同语言环境下的表现。此外，该数据集还促进了多语言语音合成技术的研究，推动了语音合成在多语言环境中的应用。这些衍生工作不仅丰富了语音处理领域的研究内容，也为实际应用提供了技术基础。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集