common_language

Name: common_language
Creator: maas
Published: 2025-12-03 17:21:52
License: 暂无描述

魔搭社区2025-12-03 更新2025-02-01 收录

下载链接：

https://modelscope.cn/datasets/speechbrain/common_language

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for common_language ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://zenodo.org/record/5036977 - **Repository:** https://github.com/speechbrain/speechbrain/tree/develop/recipes/CommonLanguage - **Paper:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Leaderboard:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Dataset Summary This dataset is composed of speech recordings from languages that were carefully selected from the CommonVoice database. The total duration of audio recordings is 45.1 hours (i.e., 1 hour of material for each language). The dataset has been extracted from CommonVoice to train language-id systems. ### Supported Tasks and Leaderboards The baselines for language-id are available in the SpeechBrain toolkit (see recipes/CommonLanguage): https://github.com/speechbrain/speechbrain ### Languages List of included languages: ``` Arabic, Basque, Breton, Catalan, Chinese_China, Chinese_Hongkong, Chinese_Taiwan, Chuvash, Czech, Dhivehi, Dutch, English, Esperanto, Estonian, French, Frisian, Georgian, German, Greek, Hakha_Chin, Indonesian, Interlingua, Italian, Japanese, Kabyle, Kinyarwanda, Kyrgyz, Latvian, Maltese, Mongolian, Persian, Polish, Portuguese, Romanian, Romansh_Sursilvan, Russian, Sakha, Slovenian, Spanish, Swedish, Tamil, Tatar, Turkish, Ukranian, Welsh ``` ## Dataset Structure ### Data Instances A typical data point comprises the `path` to the audio file, and its label `language`. Additional fields include `age`, `client_id`, `gender` and `sentence`. ```python { 'client_id': 'itln_trn_sp_175', 'path': '/path/common_voice_kpd/Italian/train/itln_trn_sp_175/common_voice_it_18279446.wav', 'audio': {'path': '/path/common_voice_kpd/Italian/train/itln_trn_sp_175/common_voice_it_18279446.wav', 'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32), 'sampling_rate': 48000}, 'sentence': 'Con gli studenti è leggermente simile.', 'age': 'not_defined', 'gender': 'not_defined', 'language': 22 } ``` ### Data Fields `client_id` (`string`): An id for which client (voice) made the recording `path` (`string`): The path to the audio file - `audio` (`dict`): A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: `dataset[0]["audio"]` the audio file is automatically decoded and resampled to `dataset.features["audio"].sampling_rate`. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the `"audio"` column, *i.e.* `dataset[0]["audio"]` should **always** be preferred over `dataset["audio"][0]`. `language` (`ClassLabel`): The language of the recording (see the `Languages` section above) `sentence` (`string`): The sentence the user was prompted to speak `age` (`string`): The age of the speaker. `gender` (`string`): The gender of the speaker ### Data Splits The dataset is already balanced and split into train, dev (validation) and test sets. | Name | Train | Dev | Test | |:---------------------------------:|:------:|:------:|:-----:| | **# of utterances** | 177552 | 47104 | 47704 | | **# unique speakers** | 11189 | 1297 | 1322 | | **Total duration, hr** | 30.04 | 7.53 | 7.53 | | **Min duration, sec** | 0.86 | 0.98 | 0.89 | | **Mean duration, sec** | 4.87 | 4.61 | 4.55 | | **Max duration, sec** | 21.72 | 105.67 | 29.83 | | **Duration per language, min** | ~40 | ~10 | ~10 | ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in the Common Voice dataset. ## Considerations for Using the Data ### Social Impact of Dataset The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in the Common Voice dataset. ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations The Mongolian and Ukrainian languages are spelled as "Mangolian" and "Ukranian" in this version of the dataset. [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [Ganesh Sinisetty; Pavlo Ruban; Oleksandr Dymov; Mirco Ravanelli](https://zenodo.org/record/5036977#.YdTZ5hPMJ70) ### Licensing Information [Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/legalcode) ### Citation Information ``` @dataset{ganesh_sinisetty_2021_5036977, author = {Ganesh Sinisetty and Pavlo Ruban and Oleksandr Dymov and Mirco Ravanelli}, title = {CommonLanguage}, month = jun, year = 2021, publisher = {Zenodo}, version = {0.1}, doi = {10.5281/zenodo.5036977}, url = {https://doi.org/10.5281/zenodo.5036977} } ``` ### Contributions Thanks to [@anton-l](https://github.com/anton-l) for adding this dataset.

# common_language 数据集卡片 ## 目录 - [数据集描述](#数据集描述) - [数据集概述](#数据集概述) - [支持任务与排行榜](#支持任务与排行榜) - [语言列表](#语言列表) - [数据集结构](#数据集结构) - [数据样例](#数据样例) - [数据字段](#数据字段) - [数据划分](#数据划分) - [数据集构建](#数据集构建) - [遴选依据](#遴选依据) - [源数据](#源数据) - [标注信息](#标注信息) - [个人与敏感信息](#个人与敏感信息) - [数据集使用注意事项](#数据集使用注意事项) - [数据集的社会影响](#数据集的社会影响) - [偏差讨论](#偏差讨论) - [其他已知局限性](#其他已知局限性) - [附加信息](#附加信息) - [数据集提供者](#数据集提供者) - [许可信息](#许可信息) - [引用信息](#引用信息) - [贡献者](#贡献者) ## 数据集描述 - **主页:** https://zenodo.org/record/5036977 - **代码仓库:** https://github.com/speechbrain/speechbrain/tree/develop/recipes/CommonLanguage - **论文:** [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **排行榜:** [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **联系方式:** [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 数据集概述本数据集由从CommonVoice数据库中精心遴选的多语言语音录音组成，总音频时长为45.1小时，即每种语言对应1小时的语料。该数据集从CommonVoice中抽取而来，旨在用于训练语言识别(language-id)系统。 ### 支持任务与排行榜语言识别(language-id)的基线模型可在SpeechBrain工具包中获取（详见recipes/CommonLanguage）： https://github.com/speechbrain/speechbrain ### 语言列表包含的语言列表：阿拉伯语（Arabic）、巴斯克语（Basque）、布列塔尼语（Breton）、加泰罗尼亚语（Catalan）、中国大陆汉语（Chinese_China）、中国香港汉语（Chinese_Hongkong）、中国台湾汉语（Chinese_Taiwan）、楚瓦什语（Chuvash）、捷克语（Czech）、迪维希语（Dhivehi）、荷兰语（Dutch）、英语（English）、世界语（Esperanto）、爱沙尼亚语（Estonian）、法语（French）、弗里西语（Frisian）、格鲁吉亚语（Georgian）、德语（German）、希腊语（Greek）、哈卡钦语（Hakha_Chin）、印尼语（Indonesian）、国际语（Interlingua）、意大利语（Italian）、日语（Japanese）、卡拜尔语（Kabyle）、卢旺达语（Kinyarwanda）、吉尔吉斯语（Kyrgyz）、拉脱维亚语（Latvian）、马耳他语（Maltese）、蒙古语（Mongolian）、波斯语（Persian）、波兰语（Polish）、葡萄牙语（Portuguese）、罗马尼亚语（Romanian）、苏尔塞兰罗曼什语（Romansh_Sursilvan）、俄语（Russian）、萨哈语（Sakha）、斯洛文尼亚语（Slovenian）、西班牙语（Spanish）、瑞典语（Swedish）、泰米尔语（Tamil）、鞑靼语（Tatar）、土耳其语（Turkish）、乌克兰语（Ukranian）、威尔士语（Welsh） ## 数据集结构 ### 数据样例一个典型的数据样本包含音频文件的路径`path`及其语言标签`language`，额外字段还包括`age`、`client_id`、`gender`与`sentence`。 python { 'client_id': 'itln_trn_sp_175', 'path': '/path/common_voice_kpd/Italian/train/itln_trn_sp_175/common_voice_it_18279446.wav', 'audio': {'path': '/path/common_voice_kpd/Italian/train/itln_trn_sp_175/common_voice_it_18279446.wav', 'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32), 'sampling_rate': 48000}, 'sentence': 'Con gli studenti è leggermente simile.', 'age': 'not_defined', 'gender': 'not_defined', 'language': 22 } ### 数据字段 `client_id`（字符串类型）：录制该语音的客户端（发声者）唯一标识 `path`（字符串类型）：音频文件的存储路径 - `audio`（字典类型）：包含音频文件路径、解码后的音频数组以及采样率的字典。请注意，当访问音频列时：`dataset[0]["audio"]` 会自动对音频文件进行解码，并重采样至 `dataset.features["audio"].sampling_rate` 指定的采样率。对大量音频文件进行解码与重采样可能耗费较长时间，因此建议优先通过样本索引查询，例如**始终优先使用 `dataset[0]["audio"]`，而非 `dataset["audio"][0]`**。 `language`（分类标签类型）：录音对应的语言（详见上文[语言列表](#语言列表)部分） `sentence`（字符串类型）：提示发声者朗读的句子文本 `age`（字符串类型）：发声者的年龄信息 `gender`（字符串类型）：发声者的性别信息 ### 数据划分本数据集已做均衡处理，并划分为训练集（train）、开发集（dev，即验证集）与测试集（test）。 | 名称 | 训练集 | 开发集 | 测试集 | |:---------------------------------:|:------:|:------:|:-----:| | **语音样本总数** | 177552 | 47104 | 47704 | | **唯一发声者数量** | 11189 | 1297 | 1322 | | **总时长（小时）** | 30.04 | 7.53 | 7.53 | | **最短时长（秒）** | 0.86 | 0.98 | 0.89 | | **平均时长（秒）** | 4.87 | 4.61 | 4.55 | | **最长时长（秒）** | 21.72 | 105.67 | 29.83 | | **单语言时长（分钟）** | ~40 | ~10 | ~10 | ## 数据集构建 ### 遴选依据 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 源数据 #### 初始数据收集与归一化 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 源语言语音提供者是谁？ [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 标注信息 #### 标注流程 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 标注者是谁？ [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 个人与敏感信息本数据集包含在线捐赠语音的发声者信息。您承诺不会尝试识别CommonVoice数据集中发声者的身份。 ## 数据集使用注意事项 ### 数据集的社会影响本数据集包含在线捐赠语音的发声者信息。您承诺不会尝试识别CommonVoice数据集中发声者的身份。 ### 偏差讨论 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 其他已知局限性本数据集版本中，蒙古语与乌克兰语的拼写分别为“Mangolian”与“Ukranian”（存在拼写误差）。 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 附加信息 ### 数据集提供者 [Ganesh Sinisetty、Pavlo Ruban、Oleksandr Dymov、Mirco Ravanelli](https://zenodo.org/record/5036977#.YdTZ5hPMJ70) ### 许可信息 [知识共享署名4.0国际许可协议（Creative Commons Attribution 4.0 International）](https://creativecommons.org/licenses/by/4.0/legalcode) ### 引用信息 @dataset{ganesh_sinisetty_2021_5036977, author = {Ganesh Sinisetty and Pavlo Ruban and Oleksandr Dymov and Mirco Ravanelli}, title = {CommonLanguage}, month = jun, year = 2021, publisher = {Zenodo}, version = {0.1}, doi = {10.5281/zenodo.5036977}, url = {https://doi.org/10.5281/zenodo.5036977} } ### 贡献者感谢[@anton-l](https://github.com/anton-l)为本数据集添加支持。

提供机构：

maas

创建时间：

2025-01-29

5,000+

优质数据集

54 个

任务类型

进入经典数据集