LST20 Corpus
收藏arXiv2020-08-12 更新2024-06-21 收录
下载链接:
https://aiforthai.in.th
下载链接
链接失效反馈官方服务:
资源简介:
LST20 Corpus是由国家电子和计算机技术中心创建的大规模泰语语料库,包含3,745篇文档,涵盖政治、经济等多个领域。该数据集提供了词分割、词性标注、命名实体识别等多层语言标注,总词汇量达3,164,864个,命名实体288,020个,适用于开发自然语言处理模型。数据集遵循CoNLL-2003格式,便于使用和研究。
The LST20 Corpus is a large-scale Thai language corpus created by the National Electronics and Computer Technology Center. It contains 3,745 documents covering multiple domains such as politics, economics and other fields. This dataset provides multi-layer linguistic annotations including word segmentation, part-of-speech tagging, named entity recognition and so on. It has a total vocabulary of 3,164,864 and 288,020 named entities, and is suitable for developing natural language processing models. The dataset follows the CoNLL-2003 format, which facilitates its usage and research.
提供机构:
国家电子和计算机技术中心
创建时间:
2020-08-12



