BEE-spoke-data/coedit-reworded-deduped
收藏Hugging Face2024-02-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/BEE-spoke-data/coedit-reworded-deduped
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
dataset_info:
- config_name: dedup-by-target
features:
- name: task
dtype: string
- name: id
dtype: string
- name: original_instruction
dtype: string
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 23629242
num_examples: 79943
download_size: 11836738
dataset_size: 23629242
- config_name: dedup-input
features:
- name: task
dtype: string
- name: id
dtype: string
- name: original_instruction
dtype: string
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 23457166
num_examples: 79293
download_size: 11795306
dataset_size: 23457166
- config_name: default
features:
- name: task
dtype: string
- name: id
dtype: string
- name: original_instruction
dtype: string
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: update_type
dtype: string
splits:
- name: train
num_bytes: 25021311
num_examples: 79943
download_size: 11862526
dataset_size: 25021311
configs:
- config_name: dedup-by-target
data_files:
- split: train
path: dedup-by-target/train-*
- config_name: dedup-input
data_files:
- split: train
path: dedup-input/train-*
- config_name: default
data_files:
- split: train
path: data/train-*
source_dataasets: chargoddard/coedit-reworded
---
# BEE-spoke-data/coedit-reworded-deduped
Minhash deduplication on the `target` column. Source data from [coedit-reworded](https://hf.co/chargoddard/coedit-reworded)
## load
```
from datasets import load_dataset
dataset = load_dataset("BEE-spoke-data/coedit-reworded-deduped", revision="refs/convert/parquet")
dataset
```
output:
```python
DatasetDict({
train: Dataset({
features: ['task', 'id', 'original_instruction', 'instruction', 'input', 'output'],
num_rows: 79943
})
})
```
## Citation
Original dataset courtesy of Grammarly:
```
@article{raheja2023coedit,
title={CoEdIT: Text Editing by Task-Specific Instruction Tuning},
author={Vipul Raheja and Dhruv Kumar and Ryan Koo and Dongyeop Kang},
year={2023},
eprint={2305.09857},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
提供机构:
BEE-spoke-data
原始信息汇总
数据集概述
许可证
- Apache 2.0
数据集配置
dedup-by-target
- 特征:
- task: string
- id: string
- original_instruction: string
- instruction: string
- input: string
- output: string
- 分割:
- train:
- 字节数: 23629242
- 样本数: 79943
- train:
- 下载大小: 11836738
- 数据集大小: 23629242
dedup-input
- 特征:
- task: string
- id: string
- original_instruction: string
- instruction: string
- input: string
- output: string
- 分割:
- train:
- 字节数: 23457166
- 样本数: 79293
- train:
- 下载大小: 11795306
- 数据集大小: 23457166
default
- 特征:
- task: string
- id: string
- original_instruction: string
- instruction: string
- input: string
- output: string
- update_type: string
- 分割:
- train:
- 字节数: 25021311
- 样本数: 79943
- train:
- 下载大小: 11862526
- 数据集大小: 25021311
数据文件
- dedup-by-target:
- train: dedup-by-target/train-*
- dedup-input:
- train: dedup-input/train-*
- default:
- train: data/train-*
源数据集
- chargoddard/coedit-reworded



