five

msakota/edisum_dataset

收藏
Hugging Face2024-04-05 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/msakota/edisum_dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit arXiv: https://arxiv.org/pdf/2404.03428.pdf language: - en pretty_name: Edisum configs: - config_name: wikipedia_processed_data data_files: - split: train path: "filtered-min30-enwiki-08-2023-data/train.csv" - split: validation path: "filtered-min30-enwiki-08-2023-data/val.csv" - split: test path: "filtered-min30-enwiki-08-2023-data/test.csv" - config_name: synthetic_data_100 data_files: - split: train path: "100_perc_synth_data/train.csv" - split: validation path: "100_perc_synth_data/val.csv" - split: test path: "100_perc_synth_data/test.csv" - config_name: synthetic_data_75 data_files: - split: train path: "75_perc_synth_data/train.csv" - split: validation path: "75_perc_synth_data/val.csv" - split: test path: "75_perc_synth_data/test.csv" - config_name: synthetic_data_50 data_files: - split: train path: "50_perc_synth_data/train.csv" - split: validation path: "50_perc_synth_data/val.csv" - split: test path: "50_perc_synth_data/test.csv" - config_name: synthetic_data_25 data_files: - split: train path: "25_perc_synth_data/train.csv" - split: validation path: "25_perc_synth_data/val.csv" - split: test path: "25_perc_synth_data/test.csv" --- # Dataset Card for Edisum ## Dataset Description For more details: - **Github repository**: https://github.com/epfl-dlab/edisum - **Paper**: https://arxiv.org/pdf/2404.03428.pdf ### Languages Edisum only contains Wikipedia data collected from English Wikipedia. Consequently, synthetic data is also only generated in English. ### Dataset Structure The Edisum meta-dataset actually comprises 5 datasets: - **wikiepdia_processed_data** (Filtered existing Wikipedia data) - **synthetic_data_100** (Fully synthetic data) - **synthetic_data_75** (Mixed dataset with 75% synthetic data) - **synthetic_data_50** (Mixed dataset with 50% synthetic data) - **synthetic_data_25** (Mixed dataset with 25% synthetic data) ### Data Fields Here is a list of the fields paired with a description. - `page_id`: A unique identifier of a Wikipedia page on which edit was performed - `revision_id`: A unique identifier of a revision of the Wikipedia page tied to the edit that was performed - `summary`: Edit summary associated with the given edit - `prev_texts`: List of sentences that were removed from the revision immediately before the edit - `cur_texts`: List of sentences that were added to the revision immediately after the edit. - *(only existing data)* `edit_types`: Types of changes performed during the edit (e.g. sentence-level change or node-level change); used to choose candidates for synthetic data generation - *(only existing data)* `user_count`: Number of edits performed by the editor that made the edit - *(only existing data)* `edit_count`: Number of edits performed on that Wikipedia page so far - *(only existing data)* `summary_length`: Length of the edit summary in characters - *(only existing data)* `summary_count`: Number of matching summaries in the dataset - *(only existing data)* `likely_canned`: If the edit summary was likely generated with the canned edit summary tool ### Licensing Information The dataset is licensed under the terms of the MIT license. ### Citation Information ``` @article{šakota2024edisum, title={Edisum: Summarizing and Explaining Wikipedia Edits at Scale}, author={Marija Šakota and Isaac Johnson and Guosheng Feng and Robert West}, journal={arXiv preprint arXiv:2404.03428} year={2024} } ```
提供机构:
msakota
原始信息汇总

数据集概述

数据集名称

  • Edisum

数据集结构

  • wikiepdia_processed_data (Filtered existing Wikipedia data)
  • synthetic_data_100 (Fully synthetic data)
  • synthetic_data_75 (Mixed dataset with 75% synthetic data)
  • synthetic_data_50 (Mixed dataset with 50% synthetic data)
  • synthetic_data_25 (Mixed dataset with 25% synthetic data)

数据文件

  • wikiepdia_processed_data
    • train: "filtered-min30-enwiki-08-2023-data/train.csv"
    • validation: "filtered-min30-enwiki-08-2023-data/val.csv"
    • test: "filtered-min30-enwiki-08-2023-data/test.csv"
  • synthetic_data_100
    • train: "100_perc_synth_data/train.csv"
    • validation: "100_perc_synth_data/val.csv"
    • test: "100_perc_synth_data/test.csv"
  • synthetic_data_75
    • train: "75_perc_synth_data/train.csv"
    • validation: "75_perc_synth_data/val.csv"
    • test: "75_perc_synth_data/test.csv"
  • synthetic_data_50
    • train: "50_perc_synth_data/train.csv"
    • validation: "50_perc_synth_data/val.csv"
    • test: "50_perc_synth_data/test.csv"
  • synthetic_data_25
    • train: "25_perc_synth_data/train.csv"
    • validation: "25_perc_synth_data/val.csv"
    • test: "25_perc_synth_data/test.csv"

数据字段

  • page_id: 唯一标识符,用于标识Wikipedia页面。
  • revision_id: 唯一标识符,用于标识Wikipedia页面的修订版本。
  • summary: 与编辑相关的编辑摘要。
  • prev_texts: 从编辑前版本中移除的句子列表。
  • cur_texts: 编辑后版本中添加的句子列表。
  • edit_types: 编辑类型(仅存在于现有数据中)。
  • user_count: 编辑者执行的编辑次数(仅存在于现有数据中)。
  • edit_count: 该Wikipedia页面上迄今为止执行的编辑次数(仅存在于现有数据中)。
  • summary_length: 编辑摘要的长度(以字符为单位)(仅存在于现有数据中)。
  • summary_count: 数据集中匹配摘要的数量(仅存在于现有数据中)。
  • likely_canned: 编辑摘要是否可能使用预设的编辑摘要工具生成(仅存在于现有数据中)。

语言

  • 英语

许可证

  • MIT许可证

引用信息

@article{šakota2024edisum, title={Edisum: Summarizing and Explaining Wikipedia Edits at Scale}, author={Marija Šakota and Isaac Johnson and Guosheng Feng and Robert West}, journal={arXiv preprint arXiv:2404.03428} year={2024} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作