coref-data/litbank_raw

Name: coref-data/litbank_raw
Creator: coref-data
Published: 2024-01-21 03:21:59
License: 暂无描述

Hugging Face2024-01-21 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/coref-data/litbank_raw

下载链接

链接失效反馈

官方服务：

资源简介：

LitBank数据集是一个包含10个配置的文学文本数据集，每个配置分为训练集、验证集和测试集。数据集的特征包括共指链、文档名称、实体、事件、元信息、原始文本、引用和句子。该数据集用于研究文学实体、事件检测和共指解析等自然语言处理任务。

提供机构：

coref-data

原始信息汇总

LitBank 数据集概述

数据集结构

LitBank 数据集包含十个配置文件，每个配置文件命名为 split_X，其中 X 的范围是 0 到 9。每个配置文件包含以下数据文件：

train 文件路径：split_X/train-*
validation 文件路径：split_X/validation-*
test 文件路径：split_X/test-*

数据特征

数据集包含以下特征：

coref_chains：列表形式的共指链，每个链包含多个提及，每个提及包含句子索引、开始和结束位置。
doc_name：文档名称。
entities：实体列表，每个实体包含 BIO 标签和对应的词。
events：事件列表，每个事件包含是否为事件的标志和对应的词。
meta_info：元信息，包括作者、日期、古腾堡项目ID和标题。
original_text：原始文本。
quotes：引述列表，每个引述包含归属、开始和结束位置、引述内容和引述ID。
sentences：句子列表，每个句子包含词列表。

引用信息

数据集引用

@inproceedings{bamman-etal-2019-annotated, title = "An annotated dataset of literary entities", author = "Bamman, David and Popat, Sejal and Shen, Sheng", editor = "Burstein, Jill and Doran, Christy and Solorio, Thamar", booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)", month = jun, year = "2019", address = "Minneapolis, Minnesota", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/N19-1220", doi = "10.18653/v1/N19-1220", pages = "2138--2144", abstract = "We present a new dataset comprised of 210,532 tokens evenly drawn from 100 different English-language literary texts annotated for ACE entity categories (person, location, geo-political entity, facility, organization, and vehicle). These categories include non-named entities (such as {}the boy{}, {}the kitchen{}) and nested structure (such as [[the cook]{}s sister]). In contrast to existing datasets built primarily on news (focused on geo-political entities and organizations), literary texts offer strikingly different distributions of entity categories, with much stronger emphasis on people and description of settings. We present empirical results demonstrating the performance of nested entity recognition models in this domain; training natively on in-domain literary data yields an improvement of over 20 absolute points in F-score (from 45.7 to 68.3), and mitigates a disparate impact in performance for male and female entities present in models trained on news data.", }

事件检测引用

@inproceedings{sims-etal-2019-literary, title = "Literary Event Detection", author = "Sims, Matthew and Park, Jong Ho and Bamman, David", editor = "Korhonen, Anna and Traum, David and M{`a}rquez, Llu{\i}s", booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2019", address = "Florence, Italy", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/P19-1353", doi = "10.18653/v1/P19-1353", pages = "3623--3634", abstract = "In this work we present a new dataset of literary events{---}events that are depicted as taking place within the imagined space of a novel. While previous work has focused on event detection in the domain of contemporary news, literature poses a number of complications for existing systems, including complex narration, the depiction of a broad array of mental states, and a strong emphasis on figurative language. We outline the annotation decisions of this new dataset and compare several models for predicting events; the best performing model, a bidirectional LSTM with BERT token representations, achieves an F1 score of 73.9. We then apply this model to a corpus of novels split across two dimensions{---}prestige and popularity{---}and demonstrate that there are statistically significant differences in the distribution of events for prestige.", }

共指消解引用

@inproceedings{bamman-etal-2020-annotated, title = "An Annotated Dataset of Coreference in {E}nglish Literature", author = "Bamman, David and Lewke, Olivia and Mansoor, Anya", editor = "Calzolari, Nicoletta and B{e}chet, Fr{e}d{e}ric and Blache, Philippe and Choukri, Khalid and Cieri, Christopher and Declerck, Thierry and Goggi, Sara and Isahara, Hitoshi and Maegaard, Bente and Mariani, Joseph and Mazo, H{e}l{`e}ne and Moreno, Asuncion and Odijk, Jan and Piperidis, Stelios", booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2020.lrec-1.6", pages = "44--54", abstract = "We present in this work a new dataset of coreference annotations for works of literature in English, covering 29,103 mentions in 210,532 tokens from 100 works of fiction published between 1719 and 1922. This dataset differs from previous coreference corpora in containing documents whose average length (2,105.3 words) is four times longer than other benchmark datasets (463.7 for OntoNotes), and contains examples of difficult coreference problems common in literature. This dataset allows for an evaluation of cross-domain performance for the task of coreference resolution, and analysis into the characteristics of long-distance within-document coreference.", language = "English", ISBN = "979-10-95546-34-4", }

搜集汇总

数据集介绍

构建方式

在自然语言处理与文学文本分析的交汇领域，coref-data/litbank_raw数据集应运而生。该数据集源自LitBank项目，精心选取了100部1719年至1922年间出版的英文文学作品，共计210,532个词元。其构建方式依托于详尽的标注体系，涵盖实体识别、事件检测与共指消解等任务。原始数据通过GitHub仓库获取，并进一步划分为10个交叉验证子集（split_0至split_9），每个子集均包含训练、验证和测试分区，确保了数据划分的标准化与可重复性。标注信息以结构化特征存储，包括共指链、实体标签、事件标识及引文结构，为深度分析文学文本中的复杂语言现象提供了坚实基础。

使用方法

使用该数据集时，研究人员可借助HuggingFace Datasets库便捷加载，通过指定配置名称（如'split_0'至'split_9'）选择特定的交叉验证子集，并利用'train'、'validation'和'test'参数划分数据。每个样本包含'doc_name'、'sentences'、'coref_chains'、'entities'、'events'及'quotes'等字段，可直接用于序列标注、共指解析或事件检测模型的训练与评估。为复现原论文实验，建议参照Bamman等人（2019, 2020）的模型设置，采用BERT等预训练语言模型进行微调。数据集的CC-BY-4.0许可协议允许广泛学术使用，但需在出版物中正确引用相关文献以尊重原始工作。

背景与挑战

背景概述

LitBank数据集由加州大学伯克利分校的David Bamman及其合作者于2019年至2020年间创建，聚焦于英语文学文本的细粒度语义标注。该数据集从100部18至20世纪初的经典小说中均匀抽取210,532个词元，涵盖实体识别（如人物、地点、组织）、事件检测与共指消解三大核心任务。与以新闻语料为主的传统数据集（如OntoNotes）不同，LitBank致力于捕捉文学作品中特有的语言现象，包括非命名实体嵌套结构、长篇叙事中的长距离共指关系，以及丰富的心理状态与比喻表达。其研究旨在推动自然语言处理模型向文学领域迁移，并揭示了在新闻数据上训练的模型对文学文本中女性实体存在显著性能偏差，而领域内训练可提升超过20个绝对百分点的F1分数。该数据集已成为计算文学分析与数字人文学科的重要基准资源。

当前挑战

LitBank所解决的领域挑战集中于文学文本的语义理解，其核心难题包括：1）文学语言的高度复杂性，如嵌套实体（如“[[the cook]'s sister]”）、跨段落的长距离共指关系，以及非字面意义的比喻表达，这对现有基于新闻语料的序列标注模型构成严峻考验；2）事件检测中需区分真实叙事事件与人物心理状态、间接引语等抽象表述，传统事件抽取方法难以适应文学体裁的叙事多样性。在数据集构建过程中，挑战亦十分显著：3）来自100部不同时代与风格小说的文本需统一标注方案，却面临作者语言习惯差异导致的标注一致性难题；4）长篇文档（平均长度超2000词）的共指链标注需人工逐句追踪，耗时且易出错，最终仅29,103个提及被标注，凸显了文学语料精细化标注的高昂成本。

常用场景

经典使用场景

在自然语言处理与计算文学研究的交汇处，coref-data/litbank_raw数据集以其独特的文学语料库属性，成为实体识别与指代消解任务的经典基准。该数据集精选100部1719年至1922年间出版的英文小说，涵盖21万余词元的精细标注，不仅包含ACE实体类别（人物、地点、地缘政治实体、设施、组织、交通工具），还囊括嵌套结构与非命名实体的标注。其经典使用场景在于训练和评估面向文学文本的嵌套实体识别模型，相较新闻语料，文学文本中人物与场景描述占据主导地位，该数据集为捕捉这种语义分布差异提供了黄金标准，推动了跨领域实体识别技术的范式革新。

解决学术问题

该数据集精准回应了文学文本自然语言处理中的三大核心学术难题：实体识别在叙事语境下的歧义消解、跨领域模型性能的显著衰减，以及长距离指代关系的复杂建模。通过提供富含隐喻、心理状态与复杂叙事的文学标注，它揭示了新闻语料训练模型在文学领域性能骤降（F1得分从68.3跌至45.7）的困境，并验证了领域内训练数据对缓解性别实体性能差异的关键作用。此外，其平均篇幅逾2000词的长文档特性，为研究跨篇章的指代链追踪与事件关联性开辟了新路径，深刻影响了计算叙事学与文体学的理论建构。

实际应用

在实际应用中，该数据集赋能数字人文研究的多维探索。文学研究者可借助基于该数据训练的模型自动抽取小说中的角色网络、事件脉络与空间变迁，从而量化分析叙事结构的历时演变。出版与推荐系统领域，其指代消解能力可提升文学文本的语义搜索引擎精度，例如实现跨章节的角色引文追溯。教育科技中，该数据集支撑的实体识别工具能辅助学生解析古典文学中复杂的人物关系与隐喻指涉，降低文本理解门槛。此外，文化遗产数字化项目中，该数据驱动的自动化标注管线已用于大规模古籍文献的实体索引构建，显著提升了人文数据库的检索效率。

数据集最近研究