Romanian micro-blogging named entity recognition (MicroBloggingNERo)
收藏NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/6905234
下载链接
链接失效反馈官方服务:
资源简介:
MicroBloggingNERo is a manually annotated corpus for named entity recognition in Romanian micro-blogging texts.
It provides gold annotations for organizations, locations, persons, time expressions, legal references, medical devices,
chemicals, anatomical parts and disorders found in micro-blogging texts. The text was anonymized, by replacing all
URLs with , user references with , person names, specific locations and organizations with new randomized names.
Anonymization was realized in the same way, regardless of the micro-blogging platform specific format.
Since names were replaced with new random ones, any resemblance to real individuals is by pure chance of the random names
generator. No real person is depicted in the included messages.
DATA
The MicroBloggingNERo corpus is available in different formats: text, span-based, and token-based.
Text files are in the folder "text" with .txt extension, in UTF-8 encoding.
Span-based annotations are given in BRAT (https://brat.nlplab.org/) ann format. These annotations can be found in folders starting with "ann_".
Token-based annotations are given in CONLLUP files, following the CoNLL-U Plus format https://universaldependencies.org/ext-format.html .
Part-of-speech tagging was realized using UDPIPE.
Named entity annotations are placed in the column "RELATE:NE" (the 11th column) as defined in the "global.columns" metadata field.
Automatic processing was performed through the RELATE platform (https://relate.racai.ro).
The archive contains:
- ann_EVERYTHING
Folder in which all the files are in .ann format and contains annotations of: legal references, persons, locations, organizations, time, chemicals, medical devices, anatomical parts and disorders.
Overlapping annotations of organizations and time entities inside legal references were allowed.
- ann_EVERYTHING_LARGEST_SPAN
Folder in which all the files are in .ann format and contains annotations of: legal references, persons, locations, organizations, time, chemicals, medical devices, anatomical parts and disorders.
Overlapping annotations were not allowed and only the longest named entities were annotated. This affects primarily the legal references class.
- ann_LEGAL_PER_LOC_ORG_TIME
Folder in which all the files are in .ann format and contains annotations of: legal references, persons, locations, organizations and time.
There are no overlapping annotations.
- ann_PER_LOC_ORG_TIME
Folder in which all the files are in .ann format and contains annotations of: persons, locations, organizations and time.
There are no overlapping annotations.
- ann_BIOMEDICAL
Folder in which all the files are in .ann format and contains annotations of: medical devices, chemicals, anatomical parts and disorders.
There are no overlapping annotations.
- conllup_EVERYTHING_LARGEST_SPAN
Folder in which all the files are in .ann format and contains annotations of: legal references, persons, locations, organizations, time, chemicals, medical devices, anatomical parts and disorders.
There are no overlapping annotations.
- conllup_LEGAL_PER_LOC_ORG_TIME
Folder in which all the files are in .conllup format and contains annotations of: legal references, persons, locations, organizations and time.
Overlapping annotations were not allowed and only the longest named entities were annotated.
- conllup_PER_LOC_ORG_TIME
Folder in which all the files are in .conllup format and contains annotations of: persons, locations, organizations and time.
Overlapping annotations were not allowed and only the longest named entities were annotated.
- conllup_BIOMEDICAL
Folder in which all the files are in .conllup format and contains annotations of: medical devices, chemicals, anatomical parts and disorders.
Overlapping annotations were not allowed and only the longest named entities were annotated.
- text
Folder containing the raw texts.
- splits.tsv
Proposed splits into train,test,valid following a distribution of 70-15-15% for each entity class, based on the ann_EVERYTHING_LARGEST_SPAN folder
LICENSING
This work is provided under the license CC BY-NC-ND 4.0 (Attribution-NonCommercial-NoDerivatives 4.0 International).
The license can be viewed online here: https://creativecommons.org/licenses/by-nc-nd/4.0/
and the full text here: https://creativecommons.org/licenses/by-nc-nd/4.0/legalcode .
CONTACT
Research Institute for Artificial Intelligence "Mihai Draganescu", Romanian Academy
Web: http://www.racai.ro
Contact emails: vasile@racai.ro , maria@racai.ro , vergi@racai.ro , elena@racai.ro
创建时间:
2022-07-27



