coref-data/litbank_raw
收藏LitBank 数据集概述
数据集结构
LitBank 数据集包含十个配置文件,每个配置文件命名为 split_X,其中 X 的范围是 0 到 9。每个配置文件包含以下数据文件:
train文件路径:split_X/train-*validation文件路径:split_X/validation-*test文件路径:split_X/test-*
数据特征
数据集包含以下特征:
coref_chains:列表形式的共指链,每个链包含多个提及,每个提及包含句子索引、开始和结束位置。doc_name:文档名称。entities:实体列表,每个实体包含 BIO 标签和对应的词。events:事件列表,每个事件包含是否为事件的标志和对应的词。meta_info:元信息,包括作者、日期、古腾堡项目ID和标题。original_text:原始文本。quotes:引述列表,每个引述包含归属、开始和结束位置、引述内容和引述ID。sentences:句子列表,每个句子包含词列表。
引用信息
数据集引用
@inproceedings{bamman-etal-2019-annotated,
title = "An annotated dataset of literary entities",
author = "Bamman, David and
Popat, Sejal and
Shen, Sheng",
editor = "Burstein, Jill and
Doran, Christy and
Solorio, Thamar",
booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
month = jun,
year = "2019",
address = "Minneapolis, Minnesota",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/N19-1220",
doi = "10.18653/v1/N19-1220",
pages = "2138--2144",
abstract = "We present a new dataset comprised of 210,532 tokens evenly drawn from 100 different English-language literary texts annotated for ACE entity categories (person, location, geo-political entity, facility, organization, and vehicle). These categories include non-named entities (such as {}the boy{}, {}the kitchen{}) and nested structure (such as [[the cook]{}s sister]). In contrast to existing datasets built primarily on news (focused on geo-political entities and organizations), literary texts offer strikingly different distributions of entity categories, with much stronger emphasis on people and description of settings. We present empirical results demonstrating the performance of nested entity recognition models in this domain; training natively on in-domain literary data yields an improvement of over 20 absolute points in F-score (from 45.7 to 68.3), and mitigates a disparate impact in performance for male and female entities present in models trained on news data.",
}
事件检测引用
@inproceedings{sims-etal-2019-literary, title = "Literary Event Detection", author = "Sims, Matthew and Park, Jong Ho and Bamman, David", editor = "Korhonen, Anna and Traum, David and M{`a}rquez, Llu{\i}s", booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2019", address = "Florence, Italy", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/P19-1353", doi = "10.18653/v1/P19-1353", pages = "3623--3634", abstract = "In this work we present a new dataset of literary events{---}events that are depicted as taking place within the imagined space of a novel. While previous work has focused on event detection in the domain of contemporary news, literature poses a number of complications for existing systems, including complex narration, the depiction of a broad array of mental states, and a strong emphasis on figurative language. We outline the annotation decisions of this new dataset and compare several models for predicting events; the best performing model, a bidirectional LSTM with BERT token representations, achieves an F1 score of 73.9. We then apply this model to a corpus of novels split across two dimensions{---}prestige and popularity{---}and demonstrate that there are statistically significant differences in the distribution of events for prestige.", }
共指消解引用
@inproceedings{bamman-etal-2020-annotated, title = "An Annotated Dataset of Coreference in {E}nglish Literature", author = "Bamman, David and Lewke, Olivia and Mansoor, Anya", editor = "Calzolari, Nicoletta and B{e}chet, Fr{e}d{e}ric and Blache, Philippe and Choukri, Khalid and Cieri, Christopher and Declerck, Thierry and Goggi, Sara and Isahara, Hitoshi and Maegaard, Bente and Mariani, Joseph and Mazo, H{e}l{`e}ne and Moreno, Asuncion and Odijk, Jan and Piperidis, Stelios", booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2020.lrec-1.6", pages = "44--54", abstract = "We present in this work a new dataset of coreference annotations for works of literature in English, covering 29,103 mentions in 210,532 tokens from 100 works of fiction published between 1719 and 1922. This dataset differs from previous coreference corpora in containing documents whose average length (2,105.3 words) is four times longer than other benchmark datasets (463.7 for OntoNotes), and contains examples of difficult coreference problems common in literature. This dataset allows for an evaluation of cross-domain performance for the task of coreference resolution, and analysis into the characteristics of long-distance within-document coreference.", language = "English", ISBN = "979-10-95546-34-4", }




