Kyrgyz Multilabel Topic Classification Dataset
收藏arXiv2023-08-30 更新2024-06-21 收录
下载链接:
https://github.com/alexeyev/kyrgyz-multi-label-topic-classification
下载链接
链接失效反馈官方服务:
资源简介:
Kyrgyz Multilabel Topic Classification Dataset是由俄罗斯圣彼得堡斯捷克洛夫数学研究所的研究团队创建的,专门针对Kyrgyz语言的多标签主题分类任务。数据集包含1500篇从2017年5月至2022年10月收集并标注的新闻文章,这些文章来源于24.kg新闻网站。数据集的创建过程涉及文章的收集、翻译和基于SentenceBERT模型的聚类标注。该数据集主要应用于Kyrgyz语言的自然语言处理领域,旨在解决低资源语言在主题分类方面的挑战,为Kyrgyz语言的NLP研究提供基础数据支持。
The Kyrgyz Multilabel Topic Classification Dataset was developed by a research team from the Steklov Institute of Mathematics at Saint Petersburg, Russian Academy of Sciences, and is specifically tailored for multilabel topic classification tasks in the Kyrgyz language. The dataset comprises 1500 annotated news articles collected between May 2017 and October 2022, sourced from the 24.kg news website. The dataset construction process involves article collection, translation, and cluster-based annotation utilizing the SentenceBERT model. This dataset is primarily applied in the natural language processing (NLP) domain for the Kyrgyz language, aiming to address the topic classification challenges faced by low-resource languages and provide foundational data support for NLP research focused on the Kyrgyz language.
提供机构:
俄罗斯圣彼得堡斯捷克洛夫数学研究所
创建时间:
2023-08-30



