five

leslyarun/c4_200m_gec_train100k_test25k

收藏
Hugging Face2022-10-26 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/leslyarun/c4_200m_gec_train100k_test25k
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en source_datasets: - allenai/c4 task_categories: - text-generation pretty_name: C4 200M Grammatical Error Correction Dataset tags: - grammatical-error-correction --- # C4 200M # Dataset Summary C4 200M Sample Dataset adopted from https://huggingface.co/datasets/liweili/c4_200m C4_200m is a collection of 185 million sentence pairs generated from the cleaned English dataset from C4. This dataset can be used in grammatical error correction (GEC) tasks. The corruption edits and scripts used to synthesize this dataset is referenced from: [C4_200M Synthetic Dataset](https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction) # Description As discussed before, this dataset contains 185 million sentence pairs. Each article has these two attributes: `input` and `output`. Here is a sample of dataset: ``` { "input": "Bitcoin is for $7,094 this morning, which CoinDesk says." "output": "Bitcoin goes for $7,094 this morning, according to CoinDesk." } ```
提供机构:
leslyarun
原始信息汇总

数据集概述

数据集名称

C4 200M Grammatical Error Correction Dataset

数据来源

  • 原始数据集:allenai/c4
  • 样本数据集:https://huggingface.co/datasets/liweili/c4_200m

数据内容

  • 类型:185 million sentence pairs
  • 用途:用于语法错误修正(GEC)任务

数据结构

  • 每条记录包含两个属性:inputoutput

示例数据

json { "input": "Bitcoin is for $7,094 this morning, which CoinDesk says." "output": "Bitcoin goes for $7,094 this morning, according to CoinDesk." }

数据生成方法

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作