合成上下文检索训练数据集
收藏arXiv2024-04-08 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/2310.10118v3
下载链接
链接失效反馈官方服务:
资源简介:
合成上下文检索训练数据集是由阿维尼翁大学信息实验室创建,用于解决长文档中命名实体识别的挑战。该数据集包含2716个样本,通过Alpaca大型语言模型生成,旨在提供文档级别的上下文信息以辅助实体识别。数据集的创建过程涉及使用特定的提示模板生成正负样本,以训练神经网络模型进行上下文检索。此数据集特别适用于文学作品的分析,旨在提高模型在长文本中的实体识别准确性。
The Synthetic Context Retrieval Training Dataset was developed by the Information Laboratory of Avignon University to address the challenges of named entity recognition (NER) in long documents. Comprising 2716 samples, this dataset is generated via the Alpaca Large Language Model (LLM), with the core objective of providing document-level contextual information to facilitate entity recognition. The dataset construction process utilizes specific prompt templates to generate both positive and negative samples, aiming to train neural network models for context retrieval. This dataset is particularly applicable to literary works analysis, with the purpose of enhancing the accuracy of entity recognition models when processing long texts.
提供机构:
阿维尼翁大学信息实验室
创建时间:
2023-10-16



