msakota/edisum_dataset
收藏Hugging Face2024-04-05 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/msakota/edisum_dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
arXiv: https://arxiv.org/pdf/2404.03428.pdf
language:
- en
pretty_name: Edisum
configs:
- config_name: wikipedia_processed_data
data_files:
- split: train
path: "filtered-min30-enwiki-08-2023-data/train.csv"
- split: validation
path: "filtered-min30-enwiki-08-2023-data/val.csv"
- split: test
path: "filtered-min30-enwiki-08-2023-data/test.csv"
- config_name: synthetic_data_100
data_files:
- split: train
path: "100_perc_synth_data/train.csv"
- split: validation
path: "100_perc_synth_data/val.csv"
- split: test
path: "100_perc_synth_data/test.csv"
- config_name: synthetic_data_75
data_files:
- split: train
path: "75_perc_synth_data/train.csv"
- split: validation
path: "75_perc_synth_data/val.csv"
- split: test
path: "75_perc_synth_data/test.csv"
- config_name: synthetic_data_50
data_files:
- split: train
path: "50_perc_synth_data/train.csv"
- split: validation
path: "50_perc_synth_data/val.csv"
- split: test
path: "50_perc_synth_data/test.csv"
- config_name: synthetic_data_25
data_files:
- split: train
path: "25_perc_synth_data/train.csv"
- split: validation
path: "25_perc_synth_data/val.csv"
- split: test
path: "25_perc_synth_data/test.csv"
---
# Dataset Card for Edisum
## Dataset Description
For more details:
- **Github repository**: https://github.com/epfl-dlab/edisum
- **Paper**: https://arxiv.org/pdf/2404.03428.pdf
### Languages
Edisum only contains Wikipedia data collected from English Wikipedia. Consequently, synthetic data is also only generated in English.
### Dataset Structure
The Edisum meta-dataset actually comprises 5 datasets:
- **wikiepdia_processed_data** (Filtered existing Wikipedia data)
- **synthetic_data_100** (Fully synthetic data)
- **synthetic_data_75** (Mixed dataset with 75% synthetic data)
- **synthetic_data_50** (Mixed dataset with 50% synthetic data)
- **synthetic_data_25** (Mixed dataset with 25% synthetic data)
### Data Fields
Here is a list of the fields paired with a description.
- `page_id`: A unique identifier of a Wikipedia page on which edit was performed
- `revision_id`: A unique identifier of a revision of the Wikipedia page tied to the edit that was performed
- `summary`: Edit summary associated with the given edit
- `prev_texts`: List of sentences that were removed from the revision immediately before the edit
- `cur_texts`: List of sentences that were added to the revision immediately after the edit.
- *(only existing data)* `edit_types`: Types of changes performed during the edit (e.g. sentence-level change or node-level change); used to choose candidates for synthetic data generation
- *(only existing data)* `user_count`: Number of edits performed by the editor that made the edit
- *(only existing data)* `edit_count`: Number of edits performed on that Wikipedia page so far
- *(only existing data)* `summary_length`: Length of the edit summary in characters
- *(only existing data)* `summary_count`: Number of matching summaries in the dataset
- *(only existing data)* `likely_canned`: If the edit summary was likely generated with the canned edit summary tool
### Licensing Information
The dataset is licensed under the terms of the MIT license.
### Citation Information
```
@article{šakota2024edisum,
title={Edisum: Summarizing and Explaining Wikipedia Edits at Scale},
author={Marija Šakota and Isaac Johnson and Guosheng Feng and Robert West},
journal={arXiv preprint arXiv:2404.03428}
year={2024}
}
```
提供机构:
msakota
原始信息汇总
数据集概述
数据集名称
- Edisum
数据集结构
- wikiepdia_processed_data (Filtered existing Wikipedia data)
- synthetic_data_100 (Fully synthetic data)
- synthetic_data_75 (Mixed dataset with 75% synthetic data)
- synthetic_data_50 (Mixed dataset with 50% synthetic data)
- synthetic_data_25 (Mixed dataset with 25% synthetic data)
数据文件
- wikiepdia_processed_data
- train: "filtered-min30-enwiki-08-2023-data/train.csv"
- validation: "filtered-min30-enwiki-08-2023-data/val.csv"
- test: "filtered-min30-enwiki-08-2023-data/test.csv"
- synthetic_data_100
- train: "100_perc_synth_data/train.csv"
- validation: "100_perc_synth_data/val.csv"
- test: "100_perc_synth_data/test.csv"
- synthetic_data_75
- train: "75_perc_synth_data/train.csv"
- validation: "75_perc_synth_data/val.csv"
- test: "75_perc_synth_data/test.csv"
- synthetic_data_50
- train: "50_perc_synth_data/train.csv"
- validation: "50_perc_synth_data/val.csv"
- test: "50_perc_synth_data/test.csv"
- synthetic_data_25
- train: "25_perc_synth_data/train.csv"
- validation: "25_perc_synth_data/val.csv"
- test: "25_perc_synth_data/test.csv"
数据字段
page_id: 唯一标识符,用于标识Wikipedia页面。revision_id: 唯一标识符,用于标识Wikipedia页面的修订版本。summary: 与编辑相关的编辑摘要。prev_texts: 从编辑前版本中移除的句子列表。cur_texts: 编辑后版本中添加的句子列表。edit_types: 编辑类型(仅存在于现有数据中)。user_count: 编辑者执行的编辑次数(仅存在于现有数据中)。edit_count: 该Wikipedia页面上迄今为止执行的编辑次数(仅存在于现有数据中)。summary_length: 编辑摘要的长度(以字符为单位)(仅存在于现有数据中)。summary_count: 数据集中匹配摘要的数量(仅存在于现有数据中)。likely_canned: 编辑摘要是否可能使用预设的编辑摘要工具生成(仅存在于现有数据中)。
语言
- 英语
许可证
- MIT许可证
引用信息
@article{šakota2024edisum, title={Edisum: Summarizing and Explaining Wikipedia Edits at Scale}, author={Marija Šakota and Isaac Johnson and Guosheng Feng and Robert West}, journal={arXiv preprint arXiv:2404.03428} year={2024} }



