pythainlp/thainer-corpus-v2.2
收藏Hugging Face2024-03-08 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/pythainlp/thainer-corpus-v2.2
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- th
license: cc-by-3.0
task_categories:
- token-classification
dataset_info:
features:
- name: words
sequence: string
- name: ner
sequence:
class_label:
names:
'0': B-PERSON
'1': I-PERSON
'2': O
'3': B-ORGANIZATION
'4': B-LOCATION
'5': I-ORGANIZATION
'6': I-LOCATION
'7': B-DATE
'8': I-DATE
'9': B-TIME
'10': I-TIME
'11': B-MONEY
'12': I-MONEY
'13': B-FACILITY
'14': I-FACILITY
'15': B-URL
'16': I-URL
'17': B-PERCENT
'18': I-PERCENT
'19': B-LEN
'20': I-LEN
'21': B-AGO
'22': I-AGO
'23': B-LAW
'24': I-LAW
'25': B-PHONE
'26': I-PHONE
'27': B-EMAIL
'28': I-EMAIL
'29': B-ZIP
'30': B-TEMPERATURE
'31': I-TEMPERATURE
splits:
- name: train
num_bytes: 3739947
num_examples: 4379
- name: validation
num_bytes: 1215876
num_examples: 1475
- name: test
num_bytes: 1243881
num_examples: 1472
download_size: 999069
dataset_size: 6199704
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
---
# Thai NER v2.2
Thai Named Entity Recognition Corpus
**You can download .conll to train named entity model in [https://zenodo.org/records/10795907](https://zenodo.org/records/10795907).**
**Size**
- Train: 3,938 docs
- Validation: 1,313 docs
- Test: 1,313 Docs
Some data come from crowdsourcing between Dec 2018 - Nov 2019. [https://github.com/wannaphong/thai-ner](https://github.com/wannaphong/thai-ner)
**Domain**
- News (It, politics, economy, social)
- PR (KKU news)
- general
**Source**
- I use sone data from Nutcha’s theses (http://pioneer.chula.ac.th/~awirote/Data-Nutcha.zip) and improve data by rechecking and adding more tagging.
- Twitter
- Blognone.com - It news
- thaigov.go.th
- kku.ac.th
And more (the lists are lost.)
**Tag**
- DATE - date
- TIME - time
- EMAIL - email
- LEN - length
- LOCATION - Location
- ORGANIZATION - Company / Organization
- PERSON - Person name
- PHONE - phone number
- TEMPERATURE - temperature
- URL - URL
- ZIP - Zip code
- MONEY - the amount
- LAW - legislation
- PERCENT - PERCENT
## Cite
> Wannaphong Phatthiyaphaibun. (2024). Thai NER 2.2 (2.2) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10795907
or BibTeX
```
@dataset{wannaphong_phatthiyaphaibun_2024_10795907,
author = {Wannaphong Phatthiyaphaibun},
title = {Thai NER 2.2},
month = mar,
year = 2024,
publisher = {Zenodo},
version = {2.2},
doi = {10.5281/zenodo.10795907},
url = {https://doi.org/10.5281/zenodo.10795907}
}
```
提供机构:
pythainlp
原始信息汇总
泰国命名实体识别数据集 (Thai NER v2.2)
数据集概述
- 语言: 泰语
- 许可: CC-BY-3.0
- 任务类别: 标记分类
数据集信息
-
特征:
- words: 字符串序列
- ner: 序列,包含以下类别标签:
- B-PERSON, I-PERSON
- O
- B-ORGANIZATION, I-ORGANIZATION
- B-LOCATION, I-LOCATION
- B-DATE, I-DATE
- B-TIME, I-TIME
- B-MONEY, I-MONEY
- B-FACILITY, I-FACILITY
- B-URL, I-URL
- B-PERCENT, I-PERCENT
- B-LEN, I-LEN
- B-AGO, I-AGO
- B-LAW, I-LAW
- B-PHONE, I-PHONE
- B-EMAIL, I-EMAIL
- B-ZIP
- B-TEMPERATURE, I-TEMPERATURE
-
数据分割:
- 训练集: 4379个样本,3739947字节
- 验证集: 1475个样本,1215876字节
- 测试集: 1472个样本,1243881字节
-
数据集大小: 6199704字节
-
下载大小: 999069字节
配置
- 默认配置:
- 训练集: data/train-*
- 验证集: data/validation-*
- 测试集: data/test-*
数据集来源
- 领域: 新闻、公关、一般
- 来源:
- Nutcha的论文数据
- Blognone.com
- thaigov.go.th
- kku.ac.th
标签
- DATE, TIME, EMAIL, LEN, LOCATION, ORGANIZATION, PERSON, PHONE, TEMPERATURE, URL, ZIP, MONEY, LAW, PERCENT
引用
@dataset{wannaphong_phatthiyaphaibun_2024_10795907, author = {Wannaphong Phatthiyaphaibun}, title = {Thai NER 2.2}, month = mar, year = 2024, publisher = {Zenodo}, version = {2.2}, doi = {10.5281/zenodo.10795907}, url = {https://doi.org/10.5281/zenodo.10795907} }



