Biomedical-TeMU/ProfNER_corpus_NER
收藏Hugging Face2022-03-10 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Biomedical-TeMU/ProfNER_corpus_NER
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
---
## Description
**Gold standard annotations for profession detection in Spanish COVID-19 tweets**
The entire corpus contains 10,000 annotated tweets. It has been split into training, validation, and test (60-20-20). The current version contains the training and development set of the shared task with Gold Standard annotations. In addition, it contains the unannotated test, and background sets will be released.
For Named Entity Recognition, profession detection, annotations are distributed in 2 formats: Brat standoff and TSV. See the Brat webpage for more information about the Brat standoff format (https://brat.nlplab.org/standoff.html).
The TSV format follows the format employed in SMM4H 2019 Task 2:
tweet_id | begin | end | type | extraction
In addition, we provide a tokenized version of the dataset. It follows the BIO format (similar to CONLL). The files were generated with the brat_to_conll.py script (included), which employs the es_core_news_sm-2.3.1 Spacy model for tokenization.
## Files of Named Entity Recognition subtask.
Content:
- One TSV file per corpus split (train and valid).
- brat: folder with annotations in Brat format. One sub-directory per corpus split (train and valid)
- BIO: folder with corpus in BIO tagging. One file per corpus split (train and valid)
- train-valid-txt-files: folder with training and validation text files. One text file per tweet. One sub-- directory per corpus split (train and valid)
- train-valid-txt-files-english: folder with training and validation text files Machine Translated to English.
- test-background-txt-files: folder with the test and background text files. You must make your predictions for these files and upload them to CodaLab.
提供机构:
Biomedical-TeMU
原始信息汇总
数据集概述
数据集名称
Gold standard annotations for profession detection in Spanish COVID-19 tweets
数据集内容
- 总数据量:10,000 条标注过的推文
- 数据划分:训练集、验证集、测试集(比例为 60-20-20)
- 当前版本包含:训练集和开发集的黄金标准标注,以及未标注的测试集和背景集(后续将发布)
数据格式
-
标注格式:提供两种格式,Brat standoff 和 TSV
-
Brat standoff:详细信息参考 Brat 网页
-
TSV 格式:遵循 SMM4H 2019 Task 2 的格式
tweet_id | begin | end | type | extraction
-
-
分词版本:采用 BIO 格式,文件由
brat_to_conll.py脚本生成,使用es_core_news_sm-2.3.1Spacy 模型进行分词
文件结构
- TSV 文件:每个数据集分割(训练和验证)对应一个 TSV 文件
- Brat 格式文件:每个数据集分割(训练和验证)对应一个子目录
- BIO 格式文件:每个数据集分割(训练和验证)对应一个文件
- 文本文件:每个推文对应一个文本文件,分为训练和验证两个子目录
- 英文翻译文本文件:训练和验证文本的机器翻译版本
- 测试和背景文本文件:用于预测的文件,需上传至 CodaLab
许可证
cc-by-4.0



