Electrotubbie/classification_Turkic_languages
收藏Hugging Face2024-01-17 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Electrotubbie/classification_Turkic_languages
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- text-classification
language:
- ba
- kk
- ky
size_categories:
- 100K<n<1M
---
## Description
A dataset with texts and the categories to which these texts belong.
## Usage
This dataset can be used to check language models for the correct classification of texts by category.
## Dataset structure:
- **lang**: the language to which the text source belongs;
- **title**: the title of the text;
- **original_text**: original text taken from a web page;
- **processed_text**: processed text using preprocessing functions;
- **category**: the category to which the text belongs;
- **processed**: flag indicating that one or more sentence has been deleted from the text;
- **url**: link to the source;
- **date**: date of publication of the text;
## The creation process
This dataset was obtained by parsing news resources of countries and regions of native speakers of Turkic languages, such as Bashkir, Kazakh and Kyrgyz.
During parsing, it was a priori believed that the language of the articles was written in the language of the region about which the news was written.
After parsing, the text of the articles was processed through the preprocessing functions described on [github](https://github.com/Electrotubbie/turk_langs_analyse ).
The scheme of text preprocessing and validation is as follows:
- cleaning the text from unnecessary constructions using regular expressions;
- splitting text into sentences using the sentenize function of the razdel module;
- making predictions for each sentence using the lid.176.bin model, as well as the fasttext module;
- deleting sentences written in non-Turkic languages;
- combining valid sentences into text and getting the processed_text column.
A dataset with texts and the categories to which these texts belong, primarily used for text classification tasks in language models. The dataset covers three languages: Bashkir, Kazakh, and Kyrgyz, with the number of texts ranging from 100,000 to 1,000,000. The structure of the dataset includes language, title, original text, processed text, category, processing flag, source link, and publication date. The creation process involves parsing articles from native language news resources of relevant regions and processing the texts through preprocessing functions to ensure that the texts only contain Turkic languages.
提供机构:
Electrotubbie
原始信息汇总
数据集描述
该数据集包含文本及其所属类别。
使用场景
该数据集可用于检查语言模型对文本按类别进行正确分类的能力。
数据集结构
- lang: 文本所属语言;
- title: 文本标题;
- original_text: 从网页获取的原始文本;
- processed_text: 经过预处理函数处理的文本;
- category: 文本所属类别;
- processed: 标记,指示文本中是否删除了一个或多个句子;
- url: 文本来源链接;
- date: 文本发布日期;
创建过程
该数据集通过解析巴什基尔、哈萨克和吉尔吉斯等突厥语系母语国家的地区新闻资源获得。在解析过程中,先验地认为文章是用新闻所涉及地区的语言书写的。解析后,文章的文本通过github上描述的预处理函数进行处理。
文本预处理和验证流程如下:
- 使用正则表达式清除文本中不必要的结构;
- 使用razdel模块的sentenize函数将文本分割成句子;
- 使用lid.176.bin模型和fasttext模块对每个句子进行预测;
- 删除非突厥语系语言的句子;
- 将有效的句子组合成文本,得到processed_text列。



