pythainlp/thainer-corpus-v2
收藏Hugging Face2024-03-08 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/pythainlp/thainer-corpus-v2
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: words
sequence: string
- name: ner
sequence:
class_label:
names:
'0': B-PERSON
'1': I-PERSON
'2': O
'3': B-ORGANIZATION
'4': B-LOCATION
'5': I-ORGANIZATION
'6': I-LOCATION
'7': B-DATE
'8': I-DATE
'9': B-TIME
'10': I-TIME
'11': B-MONEY
'12': I-MONEY
'13': B-FACILITY
'14': I-FACILITY
'15': B-URL
'16': I-URL
'17': B-PERCENT
'18': I-PERCENT
'19': B-LEN
'20': I-LEN
'21': B-AGO
'22': I-AGO
'23': B-LAW
'24': I-LAW
'25': B-PHONE
'26': I-PHONE
'27': B-EMAIL
'28': I-EMAIL
'29': B-ZIP
'30': B-TEMPERATURE
'31': I-TEMPERATURE
'32': B-DTAE
'33': I-DTAE
'34': B-DATA
'35': I-DATA
splits:
- name: train
num_bytes: 3736419
num_examples: 3938
- name: validation
num_bytes: 1214580
num_examples: 1313
- name: test
num_bytes: 1242609
num_examples: 1313
download_size: 974230
dataset_size: 6193608
license: cc-by-3.0
task_categories:
- token-classification
language:
- th
---
# Dataset Card for "thainer-corpus-v2"
## News!!!
> Thai NER v2.2 is released! Please use Thai NER 2.2 instead This corpus.
> Thai NER v2.2: [https://huggingface.co/datasets/pythainlp/thainer-corpus-v2.2](https://huggingface.co/datasets/pythainlp/thainer-corpus-v2.2)
Thai Named Entity Recognition Corpus
Home Page: [https://pythainlp.github.io/Thai-NER/version/2](https://pythainlp.github.io/Thai-NER/version/2)
Training script and split data: [https://zenodo.org/record/7761354](https://zenodo.org/record/7761354)
**You can download .conll to train named entity model in [https://zenodo.org/record/7761354](https://zenodo.org/record/7761354).**
**Size**
- Train: 3,938 docs
- Validation: 1,313 docs
- Test: 1,313 Docs
Some data come from crowdsourcing between Dec 2018 - Nov 2019. [https://github.com/wannaphong/thai-ner](https://github.com/wannaphong/thai-ner)
**Domain**
- News (It, politics, economy, social)
- PR (KKU news)
- general
**Source**
- I use sone data from Nutcha’s theses (http://pioneer.chula.ac.th/~awirote/Data-Nutcha.zip) and improve data by rechecking and adding more tagging.
- Twitter
- Blognone.com - It news
- thaigov.go.th
- kku.ac.th
And more (the lists are lost.)
**Tag**
- DATA - date
- TIME - time
- EMAIL - email
- LEN - length
- LOCATION - Location
- ORGANIZATION - Company / Organization
- PERSON - Person name
- PHONE - phone number
- TEMPERATURE - temperature
- URL - URL
- ZIP - Zip code
- MONEY - the amount
- LAW - legislation
- PERCENT - PERCENT
Download: [HuggingFace Hub](https://huggingface.co/datasets/pythainlp/thainer-corpus-v2)
## Cite
> Wannaphong Phatthiyaphaibun. (2022). Thai NER 2.0 (2.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7761354
or BibTeX
```
@dataset{wannaphong_phatthiyaphaibun_2022_7761354,
author = {Wannaphong Phatthiyaphaibun},
title = {Thai NER 2.0},
month = sep,
year = 2022,
publisher = {Zenodo},
version = {2.0},
doi = {10.5281/zenodo.7761354},
url = {https://doi.org/10.5281/zenodo.7761354}
}
```
提供机构:
pythainlp
原始信息汇总
数据集概述
数据集名称
- 名称:thainer-corpus-v2
数据集特征
- words: 字符串序列
- ner: 命名实体识别标签序列,包含以下类别:
- B-PERSON, I-PERSON, O, B-ORGANIZATION, B-LOCATION, I-ORGANIZATION, I-LOCATION, B-DATE, I-DATE, B-TIME, I-TIME, B-MONEY, I-MONEY, B-FACILITY, I-FACILITY, B-URL, I-URL, B-PERCENT, I-PERCENT, B-LEN, I-LEN, B-AGO, I-AGO, B-LAW, I-LAW, B-PHONE, I-PHONE, B-EMAIL, I-EMAIL, B-ZIP, B-TEMPERATURE, I-TEMPERATURE, B-DTAE, I-DTAE, B-DATA, I-DATA
数据集划分
- train: 3938个样本,占用3736419字节
- validation: 1313个样本,占用1214580字节
- test: 1313个样本,占用1242609字节
数据集大小
- 下载大小:974230字节
- 数据集总大小:6193608字节
许可证
- 许可证:cc-by-3.0
任务类别
- 任务:token-classification
语言
- 语言:th(泰语)
数据集来源
- 数据来源包括新闻、PR、Twitter、Blognone.com等,部分数据通过众包方式收集。
标签说明
- 数据集中的标签包括日期、时间、电子邮件、长度、位置、组织、人名、电话号码、温度、URL、邮政编码、货币、法律、百分比等。



