Novelties

Name: Novelties
Creator: 阿维尼翁信息实验室
Published: 2024-10-03 16:03:40
License: 暂无描述

arXiv2024-10-03 更新2024-10-06 收录

下载链接：

https://github.com/CompNet/Novelties

下载链接

链接失效反馈

官方服务：

资源简介：

Novelties数据集是由阿维尼翁信息实验室创建的一个用于命名实体识别（NER）的文学作品语料库。该数据集包含小说及其部分的标注，旨在训练和测试能够处理长文本的NER方法，并用于开发从文学作品中提取角色网络的管道。数据集的内容包括多种实体类型，如人物、地点、组织等，通过手动标注完成。其应用领域广泛，包括文学理论评估、叙事的历史性分析、角色识别、小说分类等。

提供机构：

阿维尼翁信息实验室

创建时间：

2024-10-03

搜集汇总

数据集介绍

构建方式

The Novelties corpus is meticulously constructed through a comprehensive annotation process focused on Named Entity Recognition (NER) within a collection of novels and excerpts. The annotation guidelines, detailed in the associated document, provide a structured framework for identifying and categorizing named entities. This process involves the application of specific instructions by annotators, who are guided by examples and rules that delineate which expressions should be marked as entities and which should not. The corpus is designed to facilitate the training and testing of NER methods capable of handling long texts, as well as supporting the development of the Renard pipeline, which aims to extract character networks from literary fiction.

特点

The Novelties corpus stands out for its rigorous and detailed annotation guidelines, which ensure a high level of consistency and accuracy in identifying named entities. It distinguishes itself by focusing not only on traditional entity types such as persons, locations, and organizations but also on broader categories like characters, groups, and miscellaneous entities. This comprehensive approach allows for a richer analysis of literary texts, enabling the extraction of nuanced character networks and supporting various literary analysis tasks. Additionally, the corpus's structure accommodates the unique challenges posed by long texts and the complex narrative structures often found in novels.

使用方法

The Novelties corpus is designed to be utilized in various computational linguistic tasks, particularly those involving Named Entity Recognition (NER) and the extraction of character networks from literary texts. Researchers and practitioners can leverage this dataset to train and evaluate NER models, enhancing their ability to handle the complexities of long texts and the intricacies of narrative structures. The corpus also supports the development of the Renard pipeline, which processes NER outputs to resolve coreferences and unify character mentions, ultimately constructing detailed character networks. These networks can be employed in tasks such as assessing literary theories, analyzing narrative historicity, detecting roles in stories, classifying novels, identifying subplots, segmenting storylines, summarizing narratives, designing recommendation systems, and aligning narratives.

背景与挑战

背景概述

Novelties数据集是由Arthur Amalvy和Vincent Labatut领导的Avignon实验室（LIA UPR 4128）于2024年创建的，专注于小说文本中的命名实体识别（NER）。该数据集的构建旨在短期内在长文本中训练和测试NER方法，并在长期内用于开发Renard管道，该管道旨在从文学小说中提取角色网络。Novelties数据集的独特之处在于其对角色实体的广泛定义，不仅包括传统的人名，还涵盖了非人类角色，如动物、机器人和魔法生物。这一扩展定义使得Novelties在文学理论评估、叙事历史性分析、角色分类等方面具有广泛的应用潜力。

当前挑战

Novelties数据集面临的挑战主要集中在两个方面。首先，命名实体的识别在小说文本中尤为复杂，因为小说中常常使用隐喻、象征和非标准语言结构，这增加了实体识别的难度。其次，数据集的构建过程中需要处理嵌套实体的问题，即一个实体可能包含另一个实体，如“美国总统”中的“美国”和“总统”。此外，小说中的角色可能通过不同的名称或描述被提及，如绰号、社会角色等，这要求标注者具备对文本的深入理解和全局视角。最后，由于小说世界的虚构性，标注过程中需要依赖外部资源，如专门的小说维基，以确保标注的准确性和一致性。

常用场景

经典使用场景

Novelties数据集的经典使用场景主要集中在命名实体识别（NER）任务上。该数据集通过精心标注的小说文本，为训练和测试NER方法提供了丰富的资源。特别是在处理长文本时，Novelties数据集展示了其在捕捉复杂命名实体方面的优势。此外，该数据集还被用于开发Renard管道，这是一个旨在从文学作品中提取角色网络的工具。通过NER步骤后的处理，包括共指消解和角色统一，Novelties数据集为构建和分析文学作品中的角色网络提供了坚实的基础。

衍生相关工作

Novelties数据集的发布和应用催生了一系列相关的经典工作。首先，基于Novelties数据集的Renard管道成为了提取和分析文学作品中角色网络的重要工具，推动了角色网络在文学研究中的应用。其次，Novelties数据集的详细标注指南和丰富的实体类型激发了更多关于文学文本中命名实体识别的研究，促进了NER技术在文学领域的深入发展。此外，Novelties数据集还为比较不同NER数据集在文学文本中的表现提供了基准，推动了跨领域NER技术的交流与合作。通过这些衍生工作，Novelties数据集不仅丰富了文学文本的数字化资源，还为相关领域的研究提供了新的思路和方法。

数据集最近研究