five

Biomedical-TeMU/ProfNER_corpus_NER

收藏
Hugging Face2022-03-10 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Biomedical-TeMU/ProfNER_corpus_NER
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 --- ## Description **Gold standard annotations for profession detection in Spanish COVID-19 tweets** The entire corpus contains 10,000 annotated tweets. It has been split into training, validation, and test (60-20-20). The current version contains the training and development set of the shared task with Gold Standard annotations. In addition, it contains the unannotated test, and background sets will be released. For Named Entity Recognition, profession detection, annotations are distributed in 2 formats: Brat standoff and TSV. See the Brat webpage for more information about the Brat standoff format (https://brat.nlplab.org/standoff.html). The TSV format follows the format employed in SMM4H 2019 Task 2: tweet_id | begin | end | type | extraction In addition, we provide a tokenized version of the dataset. It follows the BIO format (similar to CONLL). The files were generated with the brat_to_conll.py script (included), which employs the es_core_news_sm-2.3.1 Spacy model for tokenization. ## Files of Named Entity Recognition subtask. Content: - One TSV file per corpus split (train and valid). - brat: folder with annotations in Brat format. One sub-directory per corpus split (train and valid) - BIO: folder with corpus in BIO tagging. One file per corpus split (train and valid) - train-valid-txt-files: folder with training and validation text files. One text file per tweet. One sub-- directory per corpus split (train and valid) - train-valid-txt-files-english: folder with training and validation text files Machine Translated to English. - test-background-txt-files: folder with the test and background text files. You must make your predictions for these files and upload them to CodaLab.
提供机构:
Biomedical-TeMU
原始信息汇总

数据集概述

数据集名称

Gold standard annotations for profession detection in Spanish COVID-19 tweets

数据集内容

  • 总数据量:10,000 条标注过的推文
  • 数据划分:训练集、验证集、测试集(比例为 60-20-20)
  • 当前版本包含:训练集和开发集的黄金标准标注,以及未标注的测试集和背景集(后续将发布)

数据格式

  • 标注格式:提供两种格式,Brat standoff 和 TSV

    • Brat standoff:详细信息参考 Brat 网页

    • TSV 格式:遵循 SMM4H 2019 Task 2 的格式

      tweet_id | begin | end | type | extraction

  • 分词版本:采用 BIO 格式,文件由 brat_to_conll.py 脚本生成,使用 es_core_news_sm-2.3.1 Spacy 模型进行分词

文件结构

  • TSV 文件:每个数据集分割(训练和验证)对应一个 TSV 文件
  • Brat 格式文件:每个数据集分割(训练和验证)对应一个子目录
  • BIO 格式文件:每个数据集分割(训练和验证)对应一个文件
  • 文本文件:每个推文对应一个文本文件,分为训练和验证两个子目录
  • 英文翻译文本文件:训练和验证文本的机器翻译版本
  • 测试和背景文本文件:用于预测的文件,需上传至 CodaLab

许可证

cc-by-4.0

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作