five

pythainlp/thainer-corpus-v2.2

收藏
Hugging Face2024-03-08 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/pythainlp/thainer-corpus-v2.2
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - th license: cc-by-3.0 task_categories: - token-classification dataset_info: features: - name: words sequence: string - name: ner sequence: class_label: names: '0': B-PERSON '1': I-PERSON '2': O '3': B-ORGANIZATION '4': B-LOCATION '5': I-ORGANIZATION '6': I-LOCATION '7': B-DATE '8': I-DATE '9': B-TIME '10': I-TIME '11': B-MONEY '12': I-MONEY '13': B-FACILITY '14': I-FACILITY '15': B-URL '16': I-URL '17': B-PERCENT '18': I-PERCENT '19': B-LEN '20': I-LEN '21': B-AGO '22': I-AGO '23': B-LAW '24': I-LAW '25': B-PHONE '26': I-PHONE '27': B-EMAIL '28': I-EMAIL '29': B-ZIP '30': B-TEMPERATURE '31': I-TEMPERATURE splits: - name: train num_bytes: 3739947 num_examples: 4379 - name: validation num_bytes: 1215876 num_examples: 1475 - name: test num_bytes: 1243881 num_examples: 1472 download_size: 999069 dataset_size: 6199704 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* --- # Thai NER v2.2 Thai Named Entity Recognition Corpus **You can download .conll to train named entity model in [https://zenodo.org/records/10795907](https://zenodo.org/records/10795907).** **Size** - Train: 3,938 docs - Validation: 1,313 docs - Test: 1,313 Docs Some data come from crowdsourcing between Dec 2018 - Nov 2019. [https://github.com/wannaphong/thai-ner](https://github.com/wannaphong/thai-ner) **Domain** - News (It, politics, economy, social) - PR (KKU news) - general **Source** - I use sone data from Nutcha’s theses (http://pioneer.chula.ac.th/~awirote/Data-Nutcha.zip) and improve data by rechecking and adding more tagging. - Twitter - Blognone.com - It news - thaigov.go.th - kku.ac.th And more (the lists are lost.) **Tag** - DATE - date - TIME - time - EMAIL - email - LEN - length - LOCATION - Location - ORGANIZATION - Company / Organization - PERSON - Person name - PHONE - phone number - TEMPERATURE - temperature - URL - URL - ZIP - Zip code - MONEY - the amount - LAW - legislation - PERCENT - PERCENT ## Cite > Wannaphong Phatthiyaphaibun. (2024). Thai NER 2.2 (2.2) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10795907 or BibTeX ``` @dataset{wannaphong_phatthiyaphaibun_2024_10795907, author = {Wannaphong Phatthiyaphaibun}, title = {Thai NER 2.2}, month = mar, year = 2024, publisher = {Zenodo}, version = {2.2}, doi = {10.5281/zenodo.10795907}, url = {https://doi.org/10.5281/zenodo.10795907} } ```
提供机构:
pythainlp
原始信息汇总

泰国命名实体识别数据集 (Thai NER v2.2)

数据集概述

  • 语言: 泰语
  • 许可: CC-BY-3.0
  • 任务类别: 标记分类

数据集信息

  • 特征:

    • words: 字符串序列
    • ner: 序列,包含以下类别标签:
      • B-PERSON, I-PERSON
      • O
      • B-ORGANIZATION, I-ORGANIZATION
      • B-LOCATION, I-LOCATION
      • B-DATE, I-DATE
      • B-TIME, I-TIME
      • B-MONEY, I-MONEY
      • B-FACILITY, I-FACILITY
      • B-URL, I-URL
      • B-PERCENT, I-PERCENT
      • B-LEN, I-LEN
      • B-AGO, I-AGO
      • B-LAW, I-LAW
      • B-PHONE, I-PHONE
      • B-EMAIL, I-EMAIL
      • B-ZIP
      • B-TEMPERATURE, I-TEMPERATURE
  • 数据分割:

    • 训练集: 4379个样本,3739947字节
    • 验证集: 1475个样本,1215876字节
    • 测试集: 1472个样本,1243881字节
  • 数据集大小: 6199704字节

  • 下载大小: 999069字节

配置

  • 默认配置:
    • 训练集: data/train-*
    • 验证集: data/validation-*
    • 测试集: data/test-*

数据集来源

  • 领域: 新闻、公关、一般
  • 来源:
    • Nutcha的论文数据
    • Twitter
    • Blognone.com
    • thaigov.go.th
    • kku.ac.th

标签

  • DATE, TIME, EMAIL, LEN, LOCATION, ORGANIZATION, PERSON, PHONE, TEMPERATURE, URL, ZIP, MONEY, LAW, PERCENT

引用

@dataset{wannaphong_phatthiyaphaibun_2024_10795907, author = {Wannaphong Phatthiyaphaibun}, title = {Thai NER 2.2}, month = mar, year = 2024, publisher = {Zenodo}, version = {2.2}, doi = {10.5281/zenodo.10795907}, url = {https://doi.org/10.5281/zenodo.10795907} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作