Biomedical-TeMU/ProfNER_corpus_NER

Name: Biomedical-TeMU/ProfNER_corpus_NER
Creator: Biomedical-TeMU
Published: 2022-03-10 21:50:30
License: 暂无描述

Hugging Face2022-03-10 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Biomedical-TeMU/ProfNER_corpus_NER

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 --- ## Description **Gold standard annotations for profession detection in Spanish COVID-19 tweets** The entire corpus contains 10,000 annotated tweets. It has been split into training, validation, and test (60-20-20). The current version contains the training and development set of the shared task with Gold Standard annotations. In addition, it contains the unannotated test, and background sets will be released. For Named Entity Recognition, profession detection, annotations are distributed in 2 formats: Brat standoff and TSV. See the Brat webpage for more information about the Brat standoff format (https://brat.nlplab.org/standoff.html). The TSV format follows the format employed in SMM4H 2019 Task 2: tweet_id | begin | end | type | extraction In addition, we provide a tokenized version of the dataset. It follows the BIO format (similar to CONLL). The files were generated with the brat_to_conll.py script (included), which employs the es_core_news_sm-2.3.1 Spacy model for tokenization. ## Files of Named Entity Recognition subtask. Content: - One TSV file per corpus split (train and valid). - brat: folder with annotations in Brat format. One sub-directory per corpus split (train and valid) - BIO: folder with corpus in BIO tagging. One file per corpus split (train and valid) - train-valid-txt-files: folder with training and validation text files. One text file per tweet. One sub-- directory per corpus split (train and valid) - train-valid-txt-files-english: folder with training and validation text files Machine Translated to English. - test-background-txt-files: folder with the test and background text files. You must make your predictions for these files and upload them to CodaLab.

提供机构：

Biomedical-TeMU

原始信息汇总

数据集概述

数据集名称

Gold standard annotations for profession detection in Spanish COVID-19 tweets

数据集内容

总数据量：10,000 条标注过的推文
数据划分：训练集、验证集、测试集（比例为 60-20-20）
当前版本包含：训练集和开发集的黄金标准标注，以及未标注的测试集和背景集（后续将发布）

数据格式

标注格式：提供两种格式，Brat standoff 和 TSV
- Brat standoff：详细信息参考 Brat 网页
- TSV 格式：遵循 SMM4H 2019 Task 2 的格式
  
  tweet_id | begin | end | type | extraction
分词版本：采用 BIO 格式，文件由 brat_to_conll.py 脚本生成，使用 es_core_news_sm-2.3.1 Spacy 模型进行分词

文件结构

TSV 文件：每个数据集分割（训练和验证）对应一个 TSV 文件
Brat 格式文件：每个数据集分割（训练和验证）对应一个子目录
BIO 格式文件：每个数据集分割（训练和验证）对应一个文件
文本文件：每个推文对应一个文本文件，分为训练和验证两个子目录
英文翻译文本文件：训练和验证文本的机器翻译版本
测试和背景文本文件：用于预测的文件，需上传至 CodaLab

许可证

cc-by-4.0

5,000+

优质数据集

54 个

任务类型

进入经典数据集