lst-nectec/lst20

Name: lst-nectec/lst20
Creator: lst-nectec
Published: 2024-01-18 11:08:24
License: 暂无描述

Hugging Face2024-01-18 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/lst-nectec/lst20

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language_creators: - found language: - th license: - other multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - original task_categories: - token-classification task_ids: - named-entity-recognition - part-of-speech pretty_name: LST20 tags: - word-segmentation - clause-segmentation - sentence-segmentation dataset_info: features: - name: id dtype: string - name: fname dtype: string - name: tokens sequence: string - name: pos_tags sequence: class_label: names: '0': NN '1': VV '2': PU '3': CC '4': PS '5': AX '6': AV '7': FX '8': NU '9': AJ '10': CL '11': PR '12': NG '13': PA '14': XX '15': IJ - name: ner_tags sequence: class_label: names: '0': O '1': B_BRN '2': B_DES '3': B_DTM '4': B_LOC '5': B_MEA '6': B_NUM '7': B_ORG '8': B_PER '9': B_TRM '10': B_TTL '11': I_BRN '12': I_DES '13': I_DTM '14': I_LOC '15': I_MEA '16': I_NUM '17': I_ORG '18': I_PER '19': I_TRM '20': I_TTL '21': E_BRN '22': E_DES '23': E_DTM '24': E_LOC '25': E_MEA '26': E_NUM '27': E_ORG '28': E_PER '29': E_TRM '30': E_TTL - name: clause_tags sequence: class_label: names: '0': O '1': B_CLS '2': I_CLS '3': E_CLS config_name: lst20 splits: - name: train num_bytes: 107725145 num_examples: 63310 - name: validation num_bytes: 9646167 num_examples: 5620 - name: test num_bytes: 8217425 num_examples: 5250 download_size: 0 dataset_size: 125588737 --- # Dataset Card for LST20 ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://aiforthai.in.th/ - **Repository:** - **Paper:** - **Leaderboard:** - **Point of Contact:** [email](thepchai@nectec.or.th) ### Dataset Summary LST20 Corpus is a dataset for Thai language processing developed by National Electronics and Computer Technology Center (NECTEC), Thailand. It offers five layers of linguistic annotation: word boundaries, POS tagging, named entities, clause boundaries, and sentence boundaries. At a large scale, it consists of 3,164,002 words, 288,020 named entities, 248,181 clauses, and 74,180 sentences, while it is annotated with 16 distinct POS tags. All 3,745 documents are also annotated with one of 15 news genres. Regarding its sheer size, this dataset is considered large enough for developing joint neural models for NLP. Manually download at https://aiforthai.in.th/corpus.php See `LST20 Annotation Guideline.pdf` and `LST20 Brief Specification.pdf` within the downloaded `AIFORTHAI-LST20Corpus.tar.gz` for more details. ### Supported Tasks and Leaderboards - POS tagging - NER tagging - clause segmentation - sentence segmentation - word tokenization ### Languages Thai ## Dataset Structure ### Data Instances ``` {'clause_tags': [1, 2, 2, 2, 2, 2, 2, 2, 3], 'fname': 'T11964.txt', 'id': '0', 'ner_tags': [8, 0, 0, 0, 0, 0, 0, 0, 25], 'pos_tags': [0, 0, 0, 1, 0, 8, 8, 8, 0], 'tokens': ['ธรรมนูญ', 'แชมป์', 'สิงห์คลาสสิก', 'กวาด', 'รางวัล', 'แสน', 'สี่', 'หมื่น', 'บาท']} {'clause_tags': [1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3], 'fname': 'T11964.txt', 'id': '1', 'ner_tags': [8, 18, 28, 0, 0, 0, 0, 6, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 15, 25, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0, 0, 6], 'pos_tags': [0, 2, 0, 2, 1, 1, 2, 8, 2, 10, 2, 8, 2, 1, 0, 1, 0, 4, 7, 1, 0, 2, 8, 2, 10, 1, 10, 4, 2, 8, 2, 4, 0, 4, 0, 2, 8, 2, 10, 2, 8], 'tokens': ['ธรรมนูญ', '_', 'ศรีโรจน์', '_', 'เก็บ', 'เพิ่ม', '_', '4', '_', 'อันเดอร์พาร์', '_', '68', '_', 'เข้า', 'ป้าย', 'รับ', 'แชมป์', 'ใน', 'การ', 'เล่น', 'อาชีพ', '_', '19', '_', 'ปี', 'เป็น', 'ครั้ง', 'ที่', '_', '8', '_', 'ใน', 'ชีวิต', 'ด้วย', 'สกอร์', '_', '18', '_', 'อันเดอร์พาร์', '_', '270']} ``` ### Data Fields - `id`: nth sentence in each set, starting at 0 - `fname`: text file from which the sentence comes from - `tokens`: word tokens - `pos_tags`: POS tags - `ner_tags`: NER tags - `clause_tags`: clause tags ### Data Splits | | train | eval | test | all | |----------------------|-----------|-------------|-------------|-----------| | words | 2,714,848 | 240,891 | 207,295 | 3,163,034 | | named entities | 246,529 | 23,176 | 18,315 | 288,020 | | clauses | 214,645 | 17,486 | 16,050 | 246,181 | | sentences | 63,310 | 5,620 | 5,250 | 74,180 | | distinct words | 42,091 | (oov) 2,595 | (oov) 2,006 | 46,692 | | breaking spaces※ | 63,310 | 5,620 | 5,250 | 74,180 | | non-breaking spaces※※| 402,380 | 39,920 | 32,204 | 475,504 | ※ Breaking space = space that is used as a sentence boundary marker ※※ Non-breaking space = space that is not used as a sentence boundary marker ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? Respective authors of the news articles ### Annotations #### Annotation process Detailed annotation guideline can be found in `LST20 Annotation Guideline.pdf`. #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information All texts are from public news. No personal and sensitive information is expected to be included. ## Considerations for Using the Data ### Social Impact of Dataset - Large-scale Thai NER & POS tagging, clause & sentence segmentatation, word tokenization ### Discussion of Biases - All 3,745 texts are from news domain: - politics: 841 - crime and accident: 592 - economics: 512 - entertainment: 472 - sports: 402 - international: 279 - science, technology and education: 216 - health: 92 - general: 75 - royal: 54 - disaster: 52 - development: 45 - environment: 40 - culture: 40 - weather forecast: 33 - Word tokenization is done accoding to InterBEST 2009 Guideline. ### Other Known Limitations - Some NER tags do not correspond with given labels (`B`, `I`, and so on) ## Additional Information ### Dataset Curators [NECTEC](https://www.nectec.or.th/en/) ### Licensing Information 1. Non-commercial use, research, and open source Any non-commercial use of the dataset for research and open-sourced projects is encouraged and free of charge. Please cite our technical report for reference. If you want to perpetuate your models trained on our dataset and share them to the research community in Thailand, please send your models, code, and APIs to the AI for Thai Project. Please contact Dr. Thepchai Supnithi via thepchai@nectec.or.th for more information. Note that modification and redistribution of the dataset by any means are strictly prohibited unless authorized by the corpus authors. 2. Commercial use In any commercial use of the dataset, there are two options. - Option 1 (in kind): Contributing a dataset of 50,000 words completely annotated with our annotation scheme within 1 year. Your data will also be shared and recognized as a dataset co-creator in the research community in Thailand. - Option 2 (in cash): Purchasing a lifetime license for the entire dataset is required. The purchased rights of use cover only this dataset. In both options, please contact Dr. Thepchai Supnithi via thepchai@nectec.or.th for more information. ### Citation Information ``` @article{boonkwan2020annotation, title={The Annotation Guideline of LST20 Corpus}, author={Boonkwan, Prachya and Luantangsrisuk, Vorapon and Phaholphinyo, Sitthaa and Kriengket, Kanyanat and Leenoi, Dhanon and Phrombut, Charun and Boriboon, Monthika and Kosawat, Krit and Supnithi, Thepchai}, journal={arXiv preprint arXiv:2008.05055}, year={2020} } ``` ### Contributions Thanks to [@cstorm125](https://github.com/cstorm125) for adding this dataset.

The LST20 Corpus is a dataset for Thai language processing developed by the National Electronics and Computer Technology Center (NECTEC), Thailand. It offers five layers of linguistic annotation: word boundaries, POS tagging, named entities, clause boundaries, and sentence boundaries. The dataset consists of over 3 million words, 288,000 named entities, 248,000 clauses, and 74,000 sentences, annotated with 16 distinct POS tags. All 3,745 documents are annotated with one of 15 news genres. This dataset is considered large enough for developing joint neural models for NLP.

提供机构：

lst-nectec

原始信息汇总

数据集概述

基本信息

数据集名称: LST20
语言: 泰语
许可证: 其他
多语言性: 单语种
数据集大小: 10K<n<100K
源数据: 原始数据
任务类别:
- 词性标注 (POS tagging)
- 命名实体识别 (NER tagging)
- 子句分割 (clause segmentation)
- 句子分割 (sentence segmentation)
- 词 tokenization

数据集结构

数据字段

id: 句子编号，从0开始
fname: 句子来源的文本文件名
tokens: 词 token
pos_tags: 词性标注
ner_tags: 命名实体标注
clause_tags: 子句标注

数据分割

训练集:
- 字节数: 107725145
- 样本数: 63310
验证集:
- 字节数: 9646167
- 样本数: 5620
测试集:
- 字节数: 8217425
- 样本数: 5250

数据集创建

标注过程

详细标注指南可在 LST20 Annotation Guideline.pdf 中找到。

使用注意事项

数据集的社会影响

大规模泰语命名实体识别和词性标注，子句和句子分割，词 tokenization。

已知限制

部分命名实体标注与给定标签不对应（如 B, I 等）。

附加信息

数据集管理者

国家电子和计算机技术中心 (NECTEC)

许可证信息

非商业用途，研究和开源项目使用免费。
商业用途需联系 Dr. Thepchai Supnithi (thepchai@nectec.or.th)。

引用信息

@article{boonkwan2020annotation, title={The Annotation Guideline of LST20 Corpus}, author={Boonkwan, Prachya and Luantangsrisuk, Vorapon and Phaholphinyo, Sitthaa and Kriengket, Kanyanat and Leenoi, Dhanon and Phrombut, Charun and Boriboon, Monthika and Kosawat, Krit and Supnithi, Thepchai}, journal={arXiv preprint arXiv:2008.05055}, year={2020} }

搜集汇总

数据集介绍

构建方式

LST20数据集由泰国国家电子和计算机技术中心（NECTEC）开发，专门用于泰语处理。该数据集通过专家生成的方式，对3,745篇新闻文章进行了详细的语言标注，涵盖了词边界、词性标注、命名实体识别、从句边界和句子边界五个层次。这些标注遵循了LST20的详细标注指南，确保了数据的高质量和一致性。

特点

LST20数据集的显著特点在于其多层次的标注结构和大规模的语料库。它包含了3,164,002个单词、288,020个命名实体、248,181个从句和74,180个句子，并标注了16种不同的词性标签。此外，所有文档均被标注为15种新闻类别之一，使其在泰语自然语言处理领域具有广泛的应用潜力。

使用方法

LST20数据集适用于多种自然语言处理任务，包括词性标注、命名实体识别、从句分割、句子分割和词标记化。用户可以通过HuggingFace平台下载该数据集，并根据提供的标注指南进行数据处理和模型训练。对于非商业用途，数据集的使用是免费的，但需引用相关技术报告。商业用途则需联系NECTEC获取许可。

背景与挑战

背景概述

LST20数据集是由泰国国家电子和计算机技术中心（NECTEC）开发的一个用于泰语处理的语料库。该数据集提供了五个层次的语言注释：词边界、词性标注、命名实体、从句边界和句子边界。其规模庞大，包含3,164,002个单词、288,020个命名实体、248,181个从句和74,180个句子，并注释了16种不同的词性标签。所有3,745个文档还注释了15种新闻体裁。鉴于其庞大的规模，该数据集被认为足以用于开发联合神经网络模型进行自然语言处理。

当前挑战

LST20数据集在构建过程中面临多个挑战。首先，泰语作为一种无空格分隔的语言，词边界识别是一个主要难题。其次，命名实体识别（NER）和词性标注（POS）在泰语中缺乏明确的规则和标准，导致注释过程复杂且易出错。此外，从句和句子边界的标注需要高度专业化的知识和技能，以确保注释的一致性和准确性。最后，数据集的规模和多样性要求高效的注释工具和方法，以应对大规模数据处理的需求。

常用场景

经典使用场景

LST20数据集在泰语自然语言处理领域中具有广泛的应用，尤其在词性标注（POS tagging）、命名实体识别（NER tagging）、子句分割（clause segmentation）、句子分割（sentence segmentation）以及词元化（word tokenization）等任务中表现卓越。其丰富的标注信息和大规模的语料库使其成为开发和评估泰语处理模型的理想选择。

解决学术问题

LST20数据集通过提供详尽的泰语语言标注，解决了泰语自然语言处理中的多个关键学术问题。它不仅为词性标注和命名实体识别提供了高质量的训练数据，还为子句和句子分割提供了必要的边界信息。这些标注数据极大地推动了泰语处理模型的研究进展，为学术界提供了宝贵的资源。

衍生相关工作

基于LST20数据集，研究者们开展了多项相关工作，包括但不限于泰语词性标注模型的改进、命名实体识别系统的优化以及子句和句子分割技术的提升。这些工作不仅丰富了泰语自然语言处理的理论研究，还推动了相关技术的实际应用，为泰语处理领域的发展做出了重要贡献。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集