tatiana-merz/cyrillic_turkic_langs

Name: tatiana-merz/cyrillic_turkic_langs
Creator: tatiana-merz
Published: 2023-03-15 19:41:05
License: 暂无描述

Hugging Face2023-03-15 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/tatiana-merz/cyrillic_turkic_langs

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc task_categories: - text-classification language: - ba - cv - sah - tt - ky - kk - tyv - krc - ru tags: - wiki size_categories: - 10K<n<100K --- # Cyrillic dataset of 8 Turkic languages spoken in Russia and former USSR ## Dataset Description The dataset is a part of the [Leipzig Corpora (Wiki) Collection]: https://corpora.uni-leipzig.de/ For the text-classification comparison, Russian has been included to the dataset. **Paper:** Dirk Goldhahn, Thomas Eckart and Uwe Quasthoff (2012): Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), 2012. ### Dataset Summary ### Supported Tasks and Leaderboards ### Languages - ba - Bashkir - cv - Chuvash - sah - Sakha - tt - Tatar - ky - Kyrgyz - kk - Kazakh - tyv - Tuvinian - krc - Karachay-Balkar - ru - Russian ### Data Splits train: Dataset({ features: ['text', 'label'], num_rows: 72000 }) test: Dataset({ features: ['text', 'label'], num_rows: 9000 }) validation: Dataset({ features: ['text', 'label'], num_rows: 9000 }) ## Dataset Creation [Link to the notebook](https://github.com/tatiana-merz/YakuToolkit/blob/main/CyrillicTurkicCorpus.ipynb) ### Curation Rationale [More Information Needed] ### Source Data ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]

提供机构：

tatiana-merz

原始信息汇总

Cyrillic dataset of 8 Turkic languages spoken in Russia and former USSR

数据集描述

该数据集是[Leipzig Corpora (Wiki) Collection]的一部分，用于文本分类比较，其中包含俄语。

数据集摘要

支持的任务和排行榜

语言

ba - Bashkir
cv - Chuvash
sah - Sakha
tt - Tatar
ky - Kyrgyz
kk - Kazakh
tyv - Tuvinian
krc - Karachay-Balkar
ru - Russian

数据分割

训练集: 包含72000条记录，特征为text和label。
测试集: 包含9000条记录，特征为text和label。
验证集: 包含9000条记录，特征为text和label。

5,000+

优质数据集

54 个

任务类型

进入经典数据集