msakota/edisum_dataset

Name: msakota/edisum_dataset
Creator: msakota
Published: 2024-04-05 09:35:09
License: 暂无描述

Hugging Face2024-04-05 更新2024-06-11 收录

下载链接：

https://hf-mirror.com/datasets/msakota/edisum_dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit arXiv: https://arxiv.org/pdf/2404.03428.pdf language: - en pretty_name: Edisum configs: - config_name: wikipedia_processed_data data_files: - split: train path: "filtered-min30-enwiki-08-2023-data/train.csv" - split: validation path: "filtered-min30-enwiki-08-2023-data/val.csv" - split: test path: "filtered-min30-enwiki-08-2023-data/test.csv" - config_name: synthetic_data_100 data_files: - split: train path: "100_perc_synth_data/train.csv" - split: validation path: "100_perc_synth_data/val.csv" - split: test path: "100_perc_synth_data/test.csv" - config_name: synthetic_data_75 data_files: - split: train path: "75_perc_synth_data/train.csv" - split: validation path: "75_perc_synth_data/val.csv" - split: test path: "75_perc_synth_data/test.csv" - config_name: synthetic_data_50 data_files: - split: train path: "50_perc_synth_data/train.csv" - split: validation path: "50_perc_synth_data/val.csv" - split: test path: "50_perc_synth_data/test.csv" - config_name: synthetic_data_25 data_files: - split: train path: "25_perc_synth_data/train.csv" - split: validation path: "25_perc_synth_data/val.csv" - split: test path: "25_perc_synth_data/test.csv" --- # Dataset Card for Edisum ## Dataset Description For more details: - **Github repository**: https://github.com/epfl-dlab/edisum - **Paper**: https://arxiv.org/pdf/2404.03428.pdf ### Languages Edisum only contains Wikipedia data collected from English Wikipedia. Consequently, synthetic data is also only generated in English. ### Dataset Structure The Edisum meta-dataset actually comprises 5 datasets: - **wikiepdia_processed_data** (Filtered existing Wikipedia data) - **synthetic_data_100** (Fully synthetic data) - **synthetic_data_75** (Mixed dataset with 75% synthetic data) - **synthetic_data_50** (Mixed dataset with 50% synthetic data) - **synthetic_data_25** (Mixed dataset with 25% synthetic data) ### Data Fields Here is a list of the fields paired with a description. - `page_id`: A unique identifier of a Wikipedia page on which edit was performed - `revision_id`: A unique identifier of a revision of the Wikipedia page tied to the edit that was performed - `summary`: Edit summary associated with the given edit - `prev_texts`: List of sentences that were removed from the revision immediately before the edit - `cur_texts`: List of sentences that were added to the revision immediately after the edit. - *(only existing data)* `edit_types`: Types of changes performed during the edit (e.g. sentence-level change or node-level change); used to choose candidates for synthetic data generation - *(only existing data)* `user_count`: Number of edits performed by the editor that made the edit - *(only existing data)* `edit_count`: Number of edits performed on that Wikipedia page so far - *(only existing data)* `summary_length`: Length of the edit summary in characters - *(only existing data)* `summary_count`: Number of matching summaries in the dataset - *(only existing data)* `likely_canned`: If the edit summary was likely generated with the canned edit summary tool ### Licensing Information The dataset is licensed under the terms of the MIT license. ### Citation Information ``` @article{šakota2024edisum, title={Edisum: Summarizing and Explaining Wikipedia Edits at Scale}, author={Marija Šakota and Isaac Johnson and Guosheng Feng and Robert West}, journal={arXiv preprint arXiv:2404.03428} year={2024} } ```

提供机构：

msakota

原始信息汇总

数据集概述

数据集名称

Edisum

数据集结构

wikiepdia_processed_data (Filtered existing Wikipedia data)
synthetic_data_100 (Fully synthetic data)
synthetic_data_75 (Mixed dataset with 75% synthetic data)
synthetic_data_50 (Mixed dataset with 50% synthetic data)
synthetic_data_25 (Mixed dataset with 25% synthetic data)

数据文件

wikiepdia_processed_data
- train: "filtered-min30-enwiki-08-2023-data/train.csv"
- validation: "filtered-min30-enwiki-08-2023-data/val.csv"
- test: "filtered-min30-enwiki-08-2023-data/test.csv"
synthetic_data_100
- train: "100_perc_synth_data/train.csv"
- validation: "100_perc_synth_data/val.csv"
- test: "100_perc_synth_data/test.csv"
synthetic_data_75
- train: "75_perc_synth_data/train.csv"
- validation: "75_perc_synth_data/val.csv"
- test: "75_perc_synth_data/test.csv"
synthetic_data_50
- train: "50_perc_synth_data/train.csv"
- validation: "50_perc_synth_data/val.csv"
- test: "50_perc_synth_data/test.csv"
synthetic_data_25
- train: "25_perc_synth_data/train.csv"
- validation: "25_perc_synth_data/val.csv"
- test: "25_perc_synth_data/test.csv"

数据字段

page_id: 唯一标识符，用于标识Wikipedia页面。
revision_id: 唯一标识符，用于标识Wikipedia页面的修订版本。
summary: 与编辑相关的编辑摘要。
prev_texts: 从编辑前版本中移除的句子列表。
cur_texts: 编辑后版本中添加的句子列表。
edit_types: 编辑类型（仅存在于现有数据中）。
user_count: 编辑者执行的编辑次数（仅存在于现有数据中）。
edit_count: 该Wikipedia页面上迄今为止执行的编辑次数（仅存在于现有数据中）。
summary_length: 编辑摘要的长度（以字符为单位）（仅存在于现有数据中）。
summary_count: 数据集中匹配摘要的数量（仅存在于现有数据中）。
likely_canned: 编辑摘要是否可能使用预设的编辑摘要工具生成（仅存在于现有数据中）。

语言

英语

许可证

MIT许可证

引用信息

@article{šakota2024edisum, title={Edisum: Summarizing and Explaining Wikipedia Edits at Scale}, author={Marija Šakota and Isaac Johnson and Guosheng Feng and Robert West}, journal={arXiv preprint arXiv:2404.03428} year={2024} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集