BEE-spoke-data/coedit-reworded-deduped

Name: BEE-spoke-data/coedit-reworded-deduped
Creator: BEE-spoke-data
Published: 2024-02-25 08:56:17
License: 暂无描述

Hugging Face2024-02-25 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/BEE-spoke-data/coedit-reworded-deduped

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 dataset_info: - config_name: dedup-by-target features: - name: task dtype: string - name: id dtype: string - name: original_instruction dtype: string - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 23629242 num_examples: 79943 download_size: 11836738 dataset_size: 23629242 - config_name: dedup-input features: - name: task dtype: string - name: id dtype: string - name: original_instruction dtype: string - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 23457166 num_examples: 79293 download_size: 11795306 dataset_size: 23457166 - config_name: default features: - name: task dtype: string - name: id dtype: string - name: original_instruction dtype: string - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: update_type dtype: string splits: - name: train num_bytes: 25021311 num_examples: 79943 download_size: 11862526 dataset_size: 25021311 configs: - config_name: dedup-by-target data_files: - split: train path: dedup-by-target/train-* - config_name: dedup-input data_files: - split: train path: dedup-input/train-* - config_name: default data_files: - split: train path: data/train-* source_dataasets: chargoddard/coedit-reworded --- # BEE-spoke-data/coedit-reworded-deduped Minhash deduplication on the `target` column. Source data from [coedit-reworded](https://hf.co/chargoddard/coedit-reworded) ## load ``` from datasets import load_dataset dataset = load_dataset("BEE-spoke-data/coedit-reworded-deduped", revision="refs/convert/parquet") dataset ``` output: ```python DatasetDict({ train: Dataset({ features: ['task', 'id', 'original_instruction', 'instruction', 'input', 'output'], num_rows: 79943 }) }) ``` ## Citation Original dataset courtesy of Grammarly: ``` @article{raheja2023coedit, title={CoEdIT: Text Editing by Task-Specific Instruction Tuning}, author={Vipul Raheja and Dhruv Kumar and Ryan Koo and Dongyeop Kang}, year={2023}, eprint={2305.09857}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

提供机构：

BEE-spoke-data

原始信息汇总

数据集概述

许可证

Apache 2.0

数据集配置

dedup-by-target

特征:
- task: string
- id: string
- original_instruction: string
- instruction: string
- input: string
- output: string
分割:
- train:
  - 字节数: 23629242
  - 样本数: 79943
下载大小: 11836738
数据集大小: 23629242

dedup-input

特征:
- task: string
- id: string
- original_instruction: string
- instruction: string
- input: string
- output: string
分割:
- train:
  - 字节数: 23457166
  - 样本数: 79293
下载大小: 11795306
数据集大小: 23457166

default

特征:
- task: string
- id: string
- original_instruction: string
- instruction: string
- input: string
- output: string
- update_type: string
分割:
- train:
  - 字节数: 25021311
  - 样本数: 79943
下载大小: 11862526
数据集大小: 25021311

数据文件

dedup-by-target:
- train: dedup-by-target/train-*
dedup-input:
- train: dedup-input/train-*
default:
- train: data/train-*

源数据集

chargoddard/coedit-reworded

5,000+

优质数据集

54 个

任务类型

进入经典数据集