five

Romanian micro-blogging named entity recognition (MicroBloggingNERo)

收藏
NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/6905234
下载链接
链接失效反馈
官方服务:
资源简介:
MicroBloggingNERo is a manually annotated corpus for named entity recognition in Romanian micro-blogging texts.  It provides gold annotations for organizations, locations, persons, time expressions, legal references, medical devices, chemicals, anatomical parts and disorders found in micro-blogging texts. The text was anonymized, by replacing all URLs with , user references with , person names, specific locations and organizations with new randomized names.  Anonymization was realized in the same way, regardless of the micro-blogging platform specific format. Since names were replaced with new random ones, any resemblance to real individuals is by pure chance of the random names generator. No real person is depicted in the included messages. DATA The MicroBloggingNERo corpus is available in different formats: text, span-based, and token-based.  Text files are in the folder "text" with .txt extension, in UTF-8 encoding. Span-based annotations are given in BRAT (https://brat.nlplab.org/) ann format. These annotations can be found in folders starting with "ann_". Token-based annotations are given in CONLLUP files, following the CoNLL-U Plus format https://universaldependencies.org/ext-format.html . Part-of-speech tagging was realized using UDPIPE.  Named entity annotations are placed in the column "RELATE:NE" (the 11th column) as defined in the "global.columns" metadata field. Automatic processing was performed through the RELATE platform (https://relate.racai.ro). The archive contains:  - ann_EVERYTHING      Folder in which all the files are in .ann format and contains annotations of: legal references, persons, locations, organizations, time, chemicals, medical devices, anatomical parts and disorders.      Overlapping annotations of organizations and time entities inside legal references were allowed.  - ann_EVERYTHING_LARGEST_SPAN      Folder in which all the files are in .ann format and contains annotations of: legal references, persons, locations, organizations, time, chemicals, medical devices, anatomical parts and disorders.      Overlapping annotations were not allowed and only the longest named entities were annotated. This affects primarily the legal references class. - ann_LEGAL_PER_LOC_ORG_TIME      Folder in which all the files are in .ann format and contains annotations of: legal references, persons, locations, organizations and time.      There are no overlapping annotations.  - ann_PER_LOC_ORG_TIME      Folder in which all the files are in .ann format and contains annotations of: persons, locations, organizations and time.      There are no overlapping annotations.  - ann_BIOMEDICAL     Folder in which all the files are in .ann format and contains annotations of: medical devices, chemicals, anatomical parts and disorders.      There are no overlapping annotations.  - conllup_EVERYTHING_LARGEST_SPAN     Folder in which all the files are in .ann format and contains annotations of: legal references, persons, locations, organizations, time, chemicals, medical devices, anatomical parts and disorders.      There are no overlapping annotations.  - conllup_LEGAL_PER_LOC_ORG_TIME      Folder in which all the files are in .conllup format and contains annotations of: legal references, persons, locations, organizations and time.      Overlapping annotations were not allowed and only the longest named entities were annotated.  - conllup_PER_LOC_ORG_TIME      Folder in which all the files are in .conllup format and contains annotations of: persons, locations, organizations and time.      Overlapping annotations were not allowed and only the longest named entities were annotated.  - conllup_BIOMEDICAL     Folder in which all the files are in .conllup format and contains annotations of: medical devices, chemicals, anatomical parts and disorders.      Overlapping annotations were not allowed and only the longest named entities were annotated.  - text      Folder containing the raw texts. - splits.tsv     Proposed splits into train,test,valid following a distribution of 70-15-15% for each entity class, based on the ann_EVERYTHING_LARGEST_SPAN folder LICENSING This work is provided under the license CC BY-NC-ND 4.0 (Attribution-NonCommercial-NoDerivatives 4.0 International). The license can be viewed online here: https://creativecommons.org/licenses/by-nc-nd/4.0/  and the full text here: https://creativecommons.org/licenses/by-nc-nd/4.0/legalcode .  CONTACT Research Institute for Artificial Intelligence "Mihai Draganescu", Romanian Academy Web: http://www.racai.ro  Contact emails: vasile@racai.ro , maria@racai.ro , vergi@racai.ro , elena@racai.ro
创建时间:
2022-07-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作