leslyarun/c4_200m_gec_train100k_test25k
收藏Hugging Face2022-10-26 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/leslyarun/c4_200m_gec_train100k_test25k
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
source_datasets:
- allenai/c4
task_categories:
- text-generation
pretty_name: C4 200M Grammatical Error Correction Dataset
tags:
- grammatical-error-correction
---
# C4 200M
# Dataset Summary
C4 200M Sample Dataset adopted from https://huggingface.co/datasets/liweili/c4_200m
C4_200m is a collection of 185 million sentence pairs generated from the cleaned English dataset from C4. This dataset can be used in grammatical error correction (GEC) tasks.
The corruption edits and scripts used to synthesize this dataset is referenced from: [C4_200M Synthetic Dataset](https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction)
# Description
As discussed before, this dataset contains 185 million sentence pairs. Each article has these two attributes: `input` and `output`. Here is a sample of dataset:
```
{
"input": "Bitcoin is for $7,094 this morning, which CoinDesk says."
"output": "Bitcoin goes for $7,094 this morning, according to CoinDesk."
}
```
提供机构:
leslyarun
原始信息汇总
数据集概述
数据集名称
C4 200M Grammatical Error Correction Dataset
数据来源
- 原始数据集:allenai/c4
- 样本数据集:https://huggingface.co/datasets/liweili/c4_200m
数据内容
- 类型:185 million sentence pairs
- 用途:用于语法错误修正(GEC)任务
数据结构
- 每条记录包含两个属性:
input和output
示例数据
json { "input": "Bitcoin is for $7,094 this morning, which CoinDesk says." "output": "Bitcoin goes for $7,094 this morning, according to CoinDesk." }



