Romanian micro-blogging named entity recognition (MicroBloggingNERo)

NIAID Data Ecosystem2026-03-13 收录

下载链接：

https://zenodo.org/record/6905234

下载链接

链接失效反馈

官方服务：

资源简介：

MicroBloggingNERo is a manually annotated corpus for named entity recognition in Romanian micro-blogging texts. It provides gold annotations for organizations, locations, persons, time expressions, legal references, medical devices, chemicals, anatomical parts and disorders found in micro-blogging texts. The text was anonymized, by replacing all URLs with , user references with , person names, specific locations and organizations with new randomized names. Anonymization was realized in the same way, regardless of the micro-blogging platform specific format. Since names were replaced with new random ones, any resemblance to real individuals is by pure chance of the random names generator. No real person is depicted in the included messages. DATA The MicroBloggingNERo corpus is available in different formats: text, span-based, and token-based. Text files are in the folder "text" with .txt extension, in UTF-8 encoding. Span-based annotations are given in BRAT (https://brat.nlplab.org/) ann format. These annotations can be found in folders starting with "ann_". Token-based annotations are given in CONLLUP files, following the CoNLL-U Plus format https://universaldependencies.org/ext-format.html . Part-of-speech tagging was realized using UDPIPE. Named entity annotations are placed in the column "RELATE:NE" (the 11th column) as defined in the "global.columns" metadata field. Automatic processing was performed through the RELATE platform (https://relate.racai.ro). The archive contains: - ann_EVERYTHING Folder in which all the files are in .ann format and contains annotations of: legal references, persons, locations, organizations, time, chemicals, medical devices, anatomical parts and disorders. Overlapping annotations of organizations and time entities inside legal references were allowed. - ann_EVERYTHING_LARGEST_SPAN Folder in which all the files are in .ann format and contains annotations of: legal references, persons, locations, organizations, time, chemicals, medical devices, anatomical parts and disorders. Overlapping annotations were not allowed and only the longest named entities were annotated. This affects primarily the legal references class. - ann_LEGAL_PER_LOC_ORG_TIME Folder in which all the files are in .ann format and contains annotations of: legal references, persons, locations, organizations and time. There are no overlapping annotations. - ann_PER_LOC_ORG_TIME Folder in which all the files are in .ann format and contains annotations of: persons, locations, organizations and time. There are no overlapping annotations. - ann_BIOMEDICAL Folder in which all the files are in .ann format and contains annotations of: medical devices, chemicals, anatomical parts and disorders. There are no overlapping annotations. - conllup_EVERYTHING_LARGEST_SPAN Folder in which all the files are in .ann format and contains annotations of: legal references, persons, locations, organizations, time, chemicals, medical devices, anatomical parts and disorders. There are no overlapping annotations. - conllup_LEGAL_PER_LOC_ORG_TIME Folder in which all the files are in .conllup format and contains annotations of: legal references, persons, locations, organizations and time. Overlapping annotations were not allowed and only the longest named entities were annotated. - conllup_PER_LOC_ORG_TIME Folder in which all the files are in .conllup format and contains annotations of: persons, locations, organizations and time. Overlapping annotations were not allowed and only the longest named entities were annotated. - conllup_BIOMEDICAL Folder in which all the files are in .conllup format and contains annotations of: medical devices, chemicals, anatomical parts and disorders. Overlapping annotations were not allowed and only the longest named entities were annotated. - text Folder containing the raw texts. - splits.tsv Proposed splits into train,test,valid following a distribution of 70-15-15% for each entity class, based on the ann_EVERYTHING_LARGEST_SPAN folder LICENSING This work is provided under the license CC BY-NC-ND 4.0 (Attribution-NonCommercial-NoDerivatives 4.0 International). The license can be viewed online here: https://creativecommons.org/licenses/by-nc-nd/4.0/ and the full text here: https://creativecommons.org/licenses/by-nc-nd/4.0/legalcode . CONTACT Research Institute for Artificial Intelligence "Mihai Draganescu", Romanian Academy Web: http://www.racai.ro Contact emails: vasile@racai.ro , maria@racai.ro , vergi@racai.ro , elena@racai.ro

创建时间：

2022-07-27

5,000+

优质数据集

54 个

任务类型

进入经典数据集