HistNERo
收藏HistNERo 数据集概述
数据集描述
HistNERo 数据集是一个用于历史罗马尼亚命名实体识别(Historical Romanian Named Entity Recognition)的数据集。该数据集包含 10,026 个句子,分为训练集(8,020 句)、验证集(1,003 句)和测试集(1,003 句)。这些句子被标注了五种命名实体:PERSON、ORGANIZATION、LOCATION、PRODUCT 和 DATE。
数据格式
数据集以 JSON 文件形式存储在 data 目录中,分为训练集、验证集和测试集。每个样本的格式如下:
json
{
"id": "528",
"ner_tags": [0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 3, 4, 4, 4, 4, 0],
"tokens": ["maĭ", "incóce", "vidu", "locuitoriĭ", "suburbiilorŭ", ",", "iar", "la", "riul", "Sabiĭului", "de", "càtrâ", "bisericâ", "romanéscâ", "gr.", "unitá", "Mai", "."],
"doc_id": "Brasov_20-_20Gazeta_20Transilvaniei_201852.ann",
"region": "Transylvania"
}
数据加载
可以通过安装 datasets 库并运行以下代码来加载数据集:
python
from datasets import load_dataset
dataset = load_dataset("avramandrei/histnero")
引用
@article{avram2024histnero, title={HistNERo: Historical Named Entity Recognition for the Romanian Language}, author={Andrei-Marius Avram and Andreea Iuga and George-Vlad Manolache and Vlad-Cristian Matei and Răzvan-Gabriel Micliuş and Vlad-Andrei Muntean and Manuel-Petru Sorlescu and Dragoş-Andrei Şerban and Adrian-Dinu Urse and Vasile Păiş and Dumitru-Clementin Cercel}, journal={arXiv preprint arXiv:2405.00155}, year={2024} }




