roemmele/ablit

Name: roemmele/ablit
Creator: roemmele
Published: 2023-05-08 16:26:23
License: 暂无描述

Hugging Face2023-05-08 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/roemmele/ablit

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-4.0 task_categories: - text-generation - text2text-generation - summarization language: - en --- # Dataset Card for AbLit ## Dataset Description - **Homepage:** https://github.com/roemmele/AbLit - **Repository:** https://github.com/roemmele/AbLit - **Paper:** https://arxiv.org/pdf/2302.06579.pdf - **Point of Contact:** melissa@roemmele.io ### Dataset Summary The AbLit dataset contains **ab**ridged versions of 10 classic English **lit**erature books, aligned with their original versions on various passage levels. The abridgements were written and made publically available by Emma Laybourn [here](http://www.englishliteratureebooks.com/classicnovelsabridged.html). This is the first known dataset for NLP research that focuses on the abridgement task. See the paper for a detailed description of the dataset, as well as the results of several modeling experiments. The GitHub repo also provides more extensive ways to interact with the data beyond what is provided here. ### Languages English ## Dataset Structure Each passage in the original version of a book chapter is aligned with its corresponding passage in the abridged version. These aligned pairs are available for various passage sizes: sentences, paragraphs, and multi-paragraph "chunks". The passage size is specified when loading the dataset. There are train/dev/test splits for items of each size. | Passage Size | Description | # Train | # Dev | # Test | | --------------------- | ------------- | ------- | ------- | ------- | | chapters | Each passage is a single chapter | 808 | 10 | 50 | sentences | Each passage is a sentence delimited by the NLTK sentence tokenizer | 122,219 | 1,143 | 10,431 | | paragraphs | Each passage is a paragraph delimited by a line break | 37,227 | 313 | 3,125 | | chunks-10-sentences | Each passage consists of up to X=10 number of sentences, which may span more than one paragraph. To derive chunks with other lengths X, see GitHub repo above | 14,857 | 141 | 1,264 #### Example Usage To load aligned paragraphs: ``` from datasets import load_dataset data = load_dataset("roemmele/ablit", "paragraphs") ``` ### Data Fields - original: passage text in the original version - abridged: passage text in the abridged version - book: title of book containing passage - chapter: title of chapter containing passage ## Dataset Creation ### Curation Rationale Abridgement is the task of making a text easier to understand while preserving its linguistic qualities. Abridgements are different from typical summaries: whereas summaries abstractively describe the original text, abridgements simplify the original primarily through a process of extraction. We present this dataset to promote further research on modeling the abridgement process. ### Source Data The author Emma Laybourn wrote abridged versions of classic English literature books available through Project Gutenberg. She has also provided her abridgements for free on her [website](http://www.englishliteratureebooks.com/classicnovelsabridged.html). This is how she describes her work: “This is a collection of famous novels which have been shortened and slightly simplified for the general reader. These are not summaries; each is half to two-thirds of the original length. I’ve selected works that people often find daunting because of their density or complexity: the aim is to make them easier to read, while keeping the style intact.” #### Initial Data Collection and Normalization We obtained the original and abridged versions of the books from the respective websites. #### Who are the source language producers? Emma Laybourn ### Annotations #### Annotation process We designed a procedure for automatically aligning passages between the original and abridged version of each chapter. We conducted a human evaluation to verify these alignments had high accuracy. The training split of the dataset has ~99% accuracy. The dev and test splits of the dataset were fully human-validated to ensure 100% accuracy. See the paper for further explanation. #### Who are the annotators? The alignment accuracy evaluation was conducted by the authors of the paper, who have expertise in linguistics and NLP. ### Personal and Sensitive Information None ## Considerations for Using the Data ### Social Impact of Dataset We hope this dataset will promote more research on the authoring process for producing abridgements, including models for automatically generating abridgements. Because it is a labor-intensive writing task, there are relatively few abridged versions of books. Systems that automatically produce abridgements could vastly expand the number of abridged versions of books and thus increase their readership. ### Discussion of Biases We present this dataset to introduce abridgement as an NLP task, but these abridgements are scoped to one small set of texts associated with a specific domain and author. There are significant practical reasons for this limited scope. In particular, in constrast to the books in AbLit, most recently published books are not included in publicly accessible datasets due to copyright restrictions, and the same restrictions typically apply to any abridgements of these books. For this reason, AbLit consists of British English literature from the 18th and 19th centuries. Some of the linguistic properties of these original books do not generalize to other types of English texts that would be beneficial to abridge. Moreover, the narrow cultural perspective reflected in these books is certainly not representative of the diverse modern population. Readers may find some content offensive. ### Dataset Curators The curators are the authors of the paper. ### Licensing Information cc-by-sa-4.0 ### Citation Information Roemmele, Melissa, Kyle Shaffer, Katrina Olsen, Yiyi Wang, and Steve DeNeefe. "AbLit: A Resource for Analyzing and Generating Abridged Versions of English Literature." Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (2023).

提供机构：

roemmele

原始信息汇总

数据集概述：AbLit

数据集描述

数据集名称： AbLit
数据集摘要： AbLit 包含10部经典英文文学作品的缩略版本，与原版在不同段落级别上进行对齐。这些缩略版由Emma Laybourn编写并公开发布。这是首个专注于缩略任务的自然语言处理研究数据集。
语言： 英语

数据集结构

数据组织： 每个原版书籍章节的段落与其对应的缩略版本段落对齐。对齐的段落大小包括句子、段落和多段落“块”。用户可根据段落大小加载数据集，并提供训练/开发/测试分割。

段落大小	描述	训练集数量	开发集数量	测试集数量
章节	每个段落为一个章节	808	10	50
句子	使用NLTK句子分隔符的句子	122,219	1,143	10,431
段落	使用换行符分隔的段落	37,227	313	3,125
10句子的块	最多包含10个句子的段落	14,857	141	1,264

数据字段

original： 原版段落文本
abridged： 缩略版段落文本
book： 包含段落的书籍标题
chapter： 包含段落的章节标题

数据集创建

采集理由： 缩略是一种使文本更易理解同时保留其语言特性的任务。与摘要不同，缩略主要通过提取过程简化原文。此数据集旨在推动缩略过程建模的研究。
源数据： Emma Laybourn编写的经典英文文学书籍缩略版，源自Project Gutenberg并免费提供在其网站上。
注释过程： 设计了自动对齐原版和缩略版章节段落的程序，并通过人工评估验证了高准确性。

使用数据注意事项

社会影响： 期望此数据集能促进自动生成缩略版书籍的研究，从而扩大缩略版书籍的读者群。
偏见讨论： 数据集限于特定的文本集和作者，反映了18世纪和19世纪的英国英语文学，可能不具普遍代表性。

搜集汇总

数据集介绍

构建方式

在文学文本处理领域，AbLit数据集的构建体现了对经典文学作品缩略过程的系统性探索。该数据集的核心内容源自Emma Laybourn女士对十部经典英语文学作品的缩略版本，这些作品均选自公有领域的古登堡计划。构建过程中，研究者通过自动化算法将原始文本与缩略文本在句子、段落、章节及自定义块等多个粒度上进行对齐，并设计了严谨的人工验证流程。训练集的对齐准确率达到约99%，而开发集和测试集则经过全面人工校验，确保了百分之百的准确性，从而为自然语言处理研究提供了高质量的平行语料。

特点

AbLit数据集作为首个专注于文本缩略任务的资源，其显著特点在于提供了多层次的文本对齐结构。数据集不仅包含句子和段落级别的对齐，还创新性地引入了跨段落块（chunks）的划分方式，例如最多包含十个句子的文本块，这为研究不同语言单元的缩略模式提供了灵活性。所有数据均来源于18至19世纪的英国古典文学，涵盖了特定的历史语言风格与文化视角。尽管这种范围界定带来了领域局限性，但它也为分析特定时期的语言简化与保留机制提供了珍贵的实验样本。

使用方法

在自然语言生成与文本摘要的研究中，AbLit数据集为建模文本缩略过程提供了直接的应用平台。使用者可通过Hugging Face的datasets库便捷加载不同粒度的数据，例如指定“paragraphs”参数即可获取段落对齐的文本对。数据字段包括原始文本、缩略文本、所属书名及章节标题，支持有监督的序列到序列模型训练。研究者可利用该数据集开发自动缩略生成系统，或深入分析缩略过程中的语言简化策略。鉴于数据的历史与文化背景，建议在使用时注意其时代局限性，并结合更广泛的文本类型进行模型泛化能力评估。

背景与挑战

背景概述

AbLit数据集由Melissa Roemmele等人于2023年构建，旨在为自然语言处理领域提供首个专注于文本简写任务的研究资源。该数据集源自Emma Laybourn对十部经典英语文学作品的简写版本，并与原始文本在句子、段落及章节等多个层面进行了精准对齐。其核心研究问题在于探索如何通过提取式方法简化文本，同时保留原著的文学风格与语言特质，为自动生成简写文本的模型开发奠定基础，推动了文本可读性增强与文学普及的研究进程。

当前挑战

AbLit数据集面临的挑战主要体现在领域问题与构建过程两方面。在领域层面，简写任务需平衡文本简化与风格保留，这要求模型具备深层语义理解与语言生成能力，而现有技术在处理文学性表达时仍存在局限。构建过程中，数据集受限于版权约束，仅涵盖18至19世纪的英国古典文学，其语言特征与文化视角缺乏多样性，难以泛化至现代文本；同时，自动对齐原始与简写文本的准确性验证依赖人工评估，虽经严格校验，但大规模扩展时仍面临效率与一致性的挑战。

常用场景

经典使用场景

在自然语言处理领域，文本简化与摘要生成是提升文本可读性的核心任务。AbLit数据集通过提供经典英文文学作品的原始版本与删节版本的对齐语料，为研究文本删节过程提供了首个专门资源。该数据集以句子、段落和章节等多种粒度组织对齐文本，使得研究者能够深入探索文本压缩与语言风格保持之间的平衡，为自动删节模型的训练与评估奠定了数据基础。

衍生相关工作

基于AbLit数据集，研究者已开展多项经典工作。例如，原论文中进行了多种序列到序列模型的实验，探索了基于Transformer的删节生成方法。后续研究可能进一步结合风格迁移与可控生成技术，优化删节文本的连贯性与风格一致性。该数据集也启发了对跨领域删节任务的扩展，促进了文本简化与可读性评估模型的创新。

数据集最近研究