masuidrive/cv-corpus-1.0-en-client_id-grouped
收藏Hugging Face2024-04-20 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/masuidrive/cv-corpus-1.0-en-client_id-grouped
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc0-1.0
tags:
- audio
- speaker diarization
source_datasets:
- commonvoice
task_categories:
- automatic-speech-recognition
annotations_creators:
- crowdsourced
language_creators:
- crowdsourced
---
# cv-corpus-1.0-en-client_id-grouped
This dataset is a subset of the Common Voice dataset, filtered and grouped based on the client ID (treated as speaker ID).
## Dataset Details
- The dataset is derived from the Common Voice dataset.
- The original dataset is available at [Common Voice Dataset](https://commonvoice.mozilla.org/en/datasets).
- The dataset is grouped by client ID, which is treated as the speaker ID for this dataset.
- Each group is filtered to include only client IDs with a minimum of 60 samples and a maximum of 300 samples.
- The dataset is split into train and validation sets for each client ID group, with a ratio of 8:2.
- The same client IDs exist in both the train and validation sets.
- The dataset is split into batches of 1000 samples and saved as Parquet files.
## Dataset Statistics
- Filtered client_id count: 1,505
- Filtered total entry count: 203,264
- Original total entry count: 490,483
## Sample Duration Distribution

The histogram shows the distribution of sample durations in the dataset.
## License
The Common Voice dataset is licensed under the Creative Commons Zero (CC0) license.
提供机构:
masuidrive
原始信息汇总
数据集概述
数据集来源与处理
- 来源:该数据集是Common Voice数据集的一个子集。
- 处理方式:数据集根据客户端ID(作为说话者ID)进行过滤和分组。
数据集结构
- 分组标准:每个客户端ID组至少包含60个样本,最多包含300个样本。
- 数据分割:数据集按客户端ID组分为训练集和验证集,比例为8:2。
- 样本存储:数据集以1000个样本为一批次,保存为Parquet文件。
数据集统计
- 过滤后的客户端ID数量:1,505
- 过滤后的总样本数:203,264
- 原始总样本数:490,483
样本时长分布
- 分布图:提供了样本时长分布的直方图。
许可证
- 许可证类型:Creative Commons Zero (CC0)



