CrowdSpeech

Name: CrowdSpeech
Creator: maas
Published: 2025-12-05 16:50:34
License: 暂无描述

魔搭社区2025-12-05 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/toloka/CrowdSpeech

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for CrowdSpeech ## Dataset Description - **Repository:** [GitHub](https://github.com/Toloka/CrowdSpeech) - **Paper:** [Paper](https://openreview.net/forum?id=3_hgF1NAXU7) - **Point of Contact:** research@toloka.ai ### Dataset Summary CrowdSpeech is the first publicly available large-scale dataset of crowdsourced audio transcriptions. The dataset was constructed by annotation [LibriSpeech](https://www.openslr.org/12) on [Toloka crowdsourcing platform](https://toloka.ai). CrowdSpeech consists of 22K instances having around 155K annotations obtained from crowd workers. ### Supported Tasks and Leaderboards Aggregation of crowd transcriptions. ### Languages English ## Dataset Structure ### Data Instances A data instance contains a url to the audio recording, a list of transcriptions along with the corresponding performers identifiers and ground truth. For each data instance, seven crowdsourced transcriptions are provided. ``` {'task': 'https://tlk.s3.yandex.net/annotation_tasks/librispeech/train-clean/0.mp3', 'transcriptions': "had laid before her a pair of alternatives now of course you're completely your own mistress and are as free as the bird on the bough i don't mean you were not so before but you're at present on a different footing | had laid before her a pair of alternatives now of course you are completely your own mistress and are as free as the bird on the bowl i don't mean you were not so before but you were present on a different footing | had laid before her a pair of alternatives now of course you're completely your own mistress and are as free as the bird on the bow i don't mean you are not so before but you're at present on a different footing | had laid before her a pair of alternatives now of course you're completely your own mistress and are as free as the bird on the bow i don't mean you are not so before but you're at present on a different footing | laid before her a pair of alternativesnow of course you're completely your own mistress and are as free as the bird on the bow i don't mean you're not so before but you're at present on a different footing | had laid before her a peril alternatives now of course your completely your own mistress and as free as a bird as the back bowl i don't mean you were not so before but you are present on a different footing | a lady before her a pair of alternatives now of course you're completely your own mistress and rs free as the bird on the ball i don't need you or not so before but you're at present on a different footing", 'performers': '1154 | 3449 | 3097 | 461 | 3519 | 920 | 3660', 'gt': "had laid before her a pair of alternatives now of course you're completely your own mistress and are as free as the bird on the bough i don't mean you were not so before but you're at present on a different footing"} ``` ### Data Fields * task: a string containing a url of the audio recording * transcriptions: a list of the crowdsourced transcriptions separated by '|' * performers: the corresponding performers' identifiers. * gt: ground truth transcription ### Data Splits There are five splits in the data: train, test, test.other, dev.clean and dev.other. Splits train, test and dev.clean correspond to *clean* part of LibriSpeech that contains audio recordings of higher quality with accents of the speaker being closer to the US English. Splits dev.other and test.other correspond to *other* part of LibriSpeech with the recordings more challenging for recognition. The audio recordings are gender-balanced. ## Dataset Creation ### Source Data [LibriSpeech](https://www.openslr.org/12) is a corpus of approximately 1000 hours of 16kHz read English speech. ### Annotations Annotation was done on [Toloka crowdsourcing platform](https://toloka.ai) with overlap of 7 (that is, each task was performed by 7 annotators). Only annotators who self-reported the knowledge of English had access to the annotation task. Additionally, annotators had to pass *Entrance Exam*. For this, we ask all incoming eligible workers to annotate ten audio recordings. We then compute our target metric — Word Error Rate (WER) — on these recordings and accept to the main task all workers who achieve WER of 40% or less (the smaller the value of the metric, the higher the quality of annotation). The Toloka crowdsourcing platform associates workers with unique identifiers and returns these identifiers to the requester. To further protect the data, we additionally encode each identifier with an integer that is eventually reported in our released datasets. See more details in the [paper](https://arxiv.org/pdf/2107.01091.pdf). ### Citation Information ``` @inproceedings{CrowdSpeech, author = {Pavlichenko, Nikita and Stelmakh, Ivan and Ustalov, Dmitry}, title = {{CrowdSpeech and Vox~DIY: Benchmark Dataset for Crowdsourced Audio Transcription}}, year = {2021}, booktitle = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks}, eprint = {2107.01091}, eprinttype = {arxiv}, eprintclass = {cs.SD}, url = {https://openreview.net/forum?id=3_hgF1NAXU7}, language = {english}, pubstate = {forthcoming}, } ```

# CrowdSpeech 数据集卡片 ## 数据集说明 - **仓库地址**：[GitHub](https://github.com/Toloka/CrowdSpeech) - **论文链接**：[论文](https://openreview.net/forum?id=3_hgF1NAXU7) - **联系人**：research@toloka.ai ### 数据集概述 CrowdSpeech是首个公开可用的大规模众包音频转录数据集。该数据集通过在Toloka众包平台上标注LibriSpeech构建而成，包含2.2万个数据实例，共获得来自众包工作者的约15.5万条标注结果。 ### 支持任务与排行榜众包转录聚合。 ### 语言英语 ## 数据集结构 ### 数据实例每个数据实例包含音频录制的URL、转录文本列表（附带对应标注者标识符与基准真值（ground truth））。每个数据实例均提供7条众包转录结果。 {'task': 'https://tlk.s3.yandex.net/annotation_tasks/librispeech/train-clean/0.mp3', 'transcriptions': "had laid before her a pair of alternatives now of course you're completely your own mistress and are as free as the bird on the bough i don't mean you were not so before but you're at present on a different footing | had laid before her a pair of alternatives now of course you are completely your own mistress and are as free as the bird on the bowl i don't mean you were not so before but you were present on a different footing | had laid before her a pair of alternatives now of course you're completely your own mistress and are as free as the bird on the bow i don't mean you are not so before but you're at present on a different footing | had laid before her a pair of alternatives now of course you're completely your own mistress and are as free as the bird on the bow i don't mean you are not so before but you're at present on a different footing | laid before her a pair of alternativesnow of course you're completely your own mistress and are as free as the bird on the bow i don't mean you're not so before but you're at present on a different footing | had laid before her a peril alternatives now of course your completely your own mistress and as free as a bird as the back bowl i don't mean you were not so before but you are present on a different footing | a lady before her a pair of alternatives now of course you're completely your own mistress and rs free as the bird on the ball i don't need you or not so before but you're at present on a different footing", 'performers': '1154 | 3449 | 3097 | 461 | 3519 | 920 | 3660', 'gt': "had laid before her a pair of alternatives now of course you're completely your own mistress and are as free as the bird on the bough i don't mean you were not so before but you're at present on a different footing"} ### 数据字段 * task：包含音频录制URL的字符串 * transcriptions：以“|”分隔的众包转录文本列表 * performers：对应标注者的标识符 * gt：基准真值（ground truth）转录文本 ### 数据划分数据集包含五个划分集：训练集（train）、测试集（test）、其他测试集（test.other）、干净开发集（dev.clean）与其他开发集（dev.other）。其中train、test与dev.clean对应LibriSpeech的干净子集，该子集包含音质更高、说话者口音更贴近美式英语的音频录制内容；dev.other与test.other对应LibriSpeech的其他子集，其中的录制内容对识别任务更具挑战性。所有音频录制内容均实现性别平衡。 ## 数据集构建 ### 源数据 LibriSpeech是一个包含约1000小时16kHz英语朗读语音的语料库。 ### 标注流程标注工作在Toloka众包平台上完成，标注重叠度为7（即每个任务由7名标注者完成）。仅自我报告具备英语能力的标注者可参与标注任务。此外，标注者需通过“入门测试”：要求所有符合条件的新工作者标注10条音频录制内容，随后计算其词错误率（Word Error Rate, WER）作为目标指标，仅接纳词错误率不超过40%的工作者参与主任务（该指标数值越小，标注质量越高）。Toloka众包平台会为工作者分配唯一标识符并将其返回给请求方。为进一步保护数据，我们还会将每个标识符编码为整数，最终在发布的数据集中使用该整数形式。更多细节可参阅[论文](https://arxiv.org/pdf/2107.01091.pdf)。 ### 引用信息 @inproceedings{CrowdSpeech, author = {Pavlichenko, Nikita and Stelmakh, Ivan and Ustalov, Dmitry}, title = {{CrowdSpeech and Vox~DIY: Benchmark Dataset for Crowdsourced Audio Transcription}}, year = {2021}, booktitle = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks}, eprint = {2107.01091}, eprinttype = {arxiv}, eprintclass = {cs.SD}, url = {https://openreview.net/forum?id=3_hgF1NAXU7}, language = {english}, pubstate = {forthcoming}, }

提供机构：

maas

创建时间：

2025-09-15

5,000+

优质数据集

54 个

任务类型

进入经典数据集