five

JDaniel423/running-records-errors-dataset

收藏
Hugging Face2023-05-02 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/JDaniel423/running-records-errors-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - token-classification language: - en size_categories: - 100K<n<1M tags: - education dataset_info: features: - name: audio_path dtype: string - name: asr_transcript dtype: string - name: original_text dtype: string - name: mutated_text dtype: string - name: index_tags dtype: string - name: mutated_tags dtype: string splits: - name: DEL num_bytes: 208676326 num_examples: 351867 - name: SUB num_bytes: 243003228 num_examples: 351867 - name: REP num_bytes: 303304320 num_examples: 351867 download_size: 0 dataset_size: 754983874 --- # Dataset Card for Running Records Errors Dataset ## Dataset Description - **Repository:** - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary The Running Records Errors dataset is an English-language dataset containing 1,055,601 sentences based on the Europarl corpus. As described in our paper, we take the sentences from the English version of the Europarl corpus and randomly inject three types of errors into the sentences: *repetitions*, where certain words or phrases are repeated, *substitutions*, where certain words are replaced with a different word, and *deletions*, where the word is completely omitted. The sentences are then passed into a TTS pipeline consisting of TacoTron2 and HifiGAN model to produce audio recordings of those mutated sentences. Lastly, the data is passed into a Quartznet 15x5 model which produces a transcript of the spoken audio. ### Supported Tasks and Leaderboards The original purpose of this dataset was to construct a model pipeline that could score running records assesments given a transcript of a child's speech along with the true text for that assesment. However, we provide this dataset to support other tasks involving error detection in text. ### Languages All of the data in the dataset is in English. ## Dataset Structure ### Data Instances For each instance, there is a string for the audio transcript, a string for the original text before we added any errors, as well as a string of the sentence with the errors we generated. In addition, we provide two lists. One list denotes the original position of each word in the mutated text, and the second list denotes the error applied to that word. ### Data Fields - asr_transcript: The transcript of the audio processed by our Quartznet 15x5 model. - original_text: The original text that was in the Europarl corupus. This text contains no artificial errors. - mutated_text: This text contains the errors we injected. - index_tags: This list denotes the original position of each word in `mutated_text.` - mutated_tags: This list denotes the error applied to each word in `mutated_text.` ### Data Splits - DEL: Sentences that have had random words removed. - REP: Sentences that have had repetitions inserted. - SUB: Sentences that have had words randomly substituted. ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ## Additional Information ### Dataset Curators This dataset was generated with the guidance of Carl Ehrett.
提供机构:
JDaniel423
原始信息汇总

数据集卡片 for Running Records Errors Dataset

数据集描述

数据集概述

Running Records Errors 数据集是一个英语数据集,包含 1,055,601 个句子,基于 Europarl 语料库。我们根据论文描述,从 Europarl 语料库的英语版本中提取句子,并随机注入三种类型的错误:重复(某些单词或短语被重复)、替换(某些单词被替换为不同的单词)和删除(单词被完全省略)。然后,这些句子通过 TTS 管道(由 TacoTron2 和 HifiGAN 模型组成)生成音频记录。最后,数据通过 Quartznet 15x5 模型生成音频的转录。

支持的任务和排行榜

该数据集最初目的是构建一个模型管道,可以根据儿童语音的转录和真实文本评估运行记录。然而,我们提供此数据集以支持其他涉及文本错误检测的任务。

语言

数据集中的所有数据均为英语。

数据集结构

数据实例

每个实例包含音频转录的字符串、添加任何错误之前的原始文本字符串,以及我们生成的带有错误的句子字符串。此外,我们提供两个列表。一个列表表示变异文本中每个单词的原始位置,另一个列表表示应用于该单词的错误。

数据字段

  • asr_transcript: 由我们的 Quartznet 15x5 模型处理的音频转录。
  • original_text: 来自 Europarl 语料库的原始文本。该文本不包含人工错误。
  • mutated_text: 该文本包含我们注入的错误。
  • index_tags: 该列表表示 mutated_text 中每个单词的原始位置。
  • mutated_tags: 该列表表示应用于 mutated_text 中每个单词的错误。

数据分割

  • DEL: 随机删除单词的句子。
  • REP: 插入重复内容的句子。
  • SUB: 随机替换单词的句子。

数据集创建

数据集策展人

该数据集是在 Carl Ehrett 的指导下生成的。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作