Tibetan News Text Classification and Continuation Dataset (2020–2026年)
收藏DataCite Commons2026-04-30 更新2026-05-05 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=4a226478c35e4a40973ca99a2d5b1b15
下载链接
链接失效反馈官方服务:
资源简介:
As a low resource language, although there are a large number of original news texts on the internet, Tibetan faces problems such as format confusion, high noise, and lack of task oriented annotation when directly used for natural language processing. High quality and finely annotated datasets are still severely lacking, which restricts the development of Tibetan information processing from general pre training to specialized applications. To this end, this paper has constructed and published a high-quality, task oriented text dataset for Tibetan news. The data collection time is from 2020 to 2026, covering the news released by Tibetan media platforms in the Xizang Autonomous Region of China and its surrounding areas. The dataset contains a total of 10483 news texts, each of which is stored in a UTF-8 encoded txt file, and manually labeled by many native Tibetan speakers into seven categories: law, science and health, eco-tourism, current politics, world outlook, Xizang culture, and new agriculture and animal husbandry. Data preprocessing includes sentence boundary segmentation based on Tibetan double hammer symbols or paragraphs, filtering out short phrases, unified encoding and deduplication, effectively removing noise from the original data. It is recommended to use a fixed random seed (42) to randomly shuffle all samples and dynamically divide them according to a ratio of 90% training and 10% validation. To verify the data quality, two types of experiments were conducted: text classification and text generation. In the classification task, baseline models such as Naive Bayes, logistic regression, random forest, SVM, and TextCNN achieved the highest accuracy of 81.68% on the validation set, with a macro average F1 score of 0.7968. The generation task uses a two-layer LSTM language model (syllable level) to verify that the perplexity has decreased from 750.66 to 32.82. Unlike simply collected raw news data, this dataset has undergone systematic cleaning, annotation, and quality verification, with clear category boundaries and learnable language patterns. It can be used for downstream tasks such as Tibetan text classification, text continuation, and language model fine-tuning, providing a standardized and reusable important data foundation for low resource language natural language processing research.
提供机构:
Science Data Bank
创建时间:
2026-04-28



