JDaniel423/running-records-errors-dataset
收藏Hugging Face2023-05-02 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/JDaniel423/running-records-errors-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- token-classification
language:
- en
size_categories:
- 100K<n<1M
tags:
- education
dataset_info:
features:
- name: audio_path
dtype: string
- name: asr_transcript
dtype: string
- name: original_text
dtype: string
- name: mutated_text
dtype: string
- name: index_tags
dtype: string
- name: mutated_tags
dtype: string
splits:
- name: DEL
num_bytes: 208676326
num_examples: 351867
- name: SUB
num_bytes: 243003228
num_examples: 351867
- name: REP
num_bytes: 303304320
num_examples: 351867
download_size: 0
dataset_size: 754983874
---
# Dataset Card for Running Records Errors Dataset
## Dataset Description
- **Repository:**
- **Paper:**
- **Leaderboard:**
- **Point of Contact:**
### Dataset Summary
The Running Records Errors dataset is an English-language dataset containing 1,055,601 sentences based on the Europarl corpus. As described in our paper,
we take the sentences from the English version of the Europarl corpus and randomly inject three types of errors into the sentences: *repetitions*, where
certain words or phrases are repeated, *substitutions*, where certain words are replaced with a different word, and *deletions*, where the word is completely
omitted. The sentences are then passed into a TTS pipeline consisting of TacoTron2 and HifiGAN model to produce audio recordings of those mutated sentences. Lastly,
the data is passed into a Quartznet 15x5 model which produces a transcript of the spoken audio.
### Supported Tasks and Leaderboards
The original purpose of this dataset was to construct a model pipeline that could score running records assesments given a transcript of a child's speech along with
the true text for that assesment. However, we provide this dataset to support other tasks involving error detection in text.
### Languages
All of the data in the dataset is in English.
## Dataset Structure
### Data Instances
For each instance, there is a string for the audio transcript, a string for the original text before we added any errors, as well as a string of the sentence with the errors we generated.
In addition, we provide two lists. One list denotes the original position of each word in the mutated text, and the second list denotes the error applied to that word.
### Data Fields
- asr_transcript: The transcript of the audio processed by our Quartznet 15x5 model.
- original_text: The original text that was in the Europarl corupus. This text contains no artificial errors.
- mutated_text: This text contains the errors we injected.
- index_tags: This list denotes the original position of each word in `mutated_text.`
- mutated_tags: This list denotes the error applied to each word in `mutated_text.`
### Data Splits
- DEL: Sentences that have had random words removed.
- REP: Sentences that have had repetitions inserted.
- SUB: Sentences that have had words randomly substituted.
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
## Additional Information
### Dataset Curators
This dataset was generated with the guidance of Carl Ehrett.
提供机构:
JDaniel423
原始信息汇总
数据集卡片 for Running Records Errors Dataset
数据集描述
数据集概述
Running Records Errors 数据集是一个英语数据集,包含 1,055,601 个句子,基于 Europarl 语料库。我们根据论文描述,从 Europarl 语料库的英语版本中提取句子,并随机注入三种类型的错误:重复(某些单词或短语被重复)、替换(某些单词被替换为不同的单词)和删除(单词被完全省略)。然后,这些句子通过 TTS 管道(由 TacoTron2 和 HifiGAN 模型组成)生成音频记录。最后,数据通过 Quartznet 15x5 模型生成音频的转录。
支持的任务和排行榜
该数据集最初目的是构建一个模型管道,可以根据儿童语音的转录和真实文本评估运行记录。然而,我们提供此数据集以支持其他涉及文本错误检测的任务。
语言
数据集中的所有数据均为英语。
数据集结构
数据实例
每个实例包含音频转录的字符串、添加任何错误之前的原始文本字符串,以及我们生成的带有错误的句子字符串。此外,我们提供两个列表。一个列表表示变异文本中每个单词的原始位置,另一个列表表示应用于该单词的错误。
数据字段
- asr_transcript: 由我们的 Quartznet 15x5 模型处理的音频转录。
- original_text: 来自 Europarl 语料库的原始文本。该文本不包含人工错误。
- mutated_text: 该文本包含我们注入的错误。
- index_tags: 该列表表示
mutated_text中每个单词的原始位置。 - mutated_tags: 该列表表示应用于
mutated_text中每个单词的错误。
数据分割
- DEL: 随机删除单词的句子。
- REP: 插入重复内容的句子。
- SUB: 随机替换单词的句子。
数据集创建
数据集策展人
该数据集是在 Carl Ehrett 的指导下生成的。



