tatiana-merz/cyrillic_turkic_langs
收藏Hugging Face2023-03-15 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/tatiana-merz/cyrillic_turkic_langs
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc
task_categories:
- text-classification
language:
- ba
- cv
- sah
- tt
- ky
- kk
- tyv
- krc
- ru
tags:
- wiki
size_categories:
- 10K<n<100K
---
# Cyrillic dataset of 8 Turkic languages spoken in Russia and former USSR
## Dataset Description
The dataset is a part of the [Leipzig Corpora (Wiki) Collection]: https://corpora.uni-leipzig.de/
For the text-classification comparison, Russian has been included to the dataset.
**Paper:**
Dirk Goldhahn, Thomas Eckart and Uwe Quasthoff (2012): Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), 2012.
### Dataset Summary
### Supported Tasks and Leaderboards
### Languages
- ba - Bashkir
- cv - Chuvash
- sah - Sakha
- tt - Tatar
- ky - Kyrgyz
- kk - Kazakh
- tyv - Tuvinian
- krc - Karachay-Balkar
- ru - Russian
### Data Splits
train: Dataset({
features: ['text', 'label'],
num_rows: 72000
})
test: Dataset({
features: ['text', 'label'],
num_rows: 9000
})
validation: Dataset({
features: ['text', 'label'],
num_rows: 9000
})
## Dataset Creation
[Link to the notebook](https://github.com/tatiana-merz/YakuToolkit/blob/main/CyrillicTurkicCorpus.ipynb)
### Curation Rationale
[More Information Needed]
### Source Data
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[More Information Needed]
### Citation Information
[More Information Needed]
### Contributions
[More Information Needed]
提供机构:
tatiana-merz
原始信息汇总
Cyrillic dataset of 8 Turkic languages spoken in Russia and former USSR
数据集描述
该数据集是[Leipzig Corpora (Wiki) Collection]的一部分,用于文本分类比较,其中包含俄语。
数据集摘要
支持的任务和排行榜
语言
- ba - Bashkir
- cv - Chuvash
- sah - Sakha
- tt - Tatar
- ky - Kyrgyz
- kk - Kazakh
- tyv - Tuvinian
- krc - Karachay-Balkar
- ru - Russian
数据分割
- 训练集: 包含72000条记录,特征为text和label。
- 测试集: 包含9000条记录,特征为text和label。
- 验证集: 包含9000条记录,特征为text和label。



