five

tatiana-merz/cyrillic_turkic_langs

收藏
Hugging Face2023-03-15 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/tatiana-merz/cyrillic_turkic_langs
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc task_categories: - text-classification language: - ba - cv - sah - tt - ky - kk - tyv - krc - ru tags: - wiki size_categories: - 10K<n<100K --- # Cyrillic dataset of 8 Turkic languages spoken in Russia and former USSR ## Dataset Description The dataset is a part of the [Leipzig Corpora (Wiki) Collection]: https://corpora.uni-leipzig.de/ For the text-classification comparison, Russian has been included to the dataset. **Paper:** Dirk Goldhahn, Thomas Eckart and Uwe Quasthoff (2012): Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), 2012. ### Dataset Summary ### Supported Tasks and Leaderboards ### Languages - ba - Bashkir - cv - Chuvash - sah - Sakha - tt - Tatar - ky - Kyrgyz - kk - Kazakh - tyv - Tuvinian - krc - Karachay-Balkar - ru - Russian ### Data Splits train: Dataset({ features: ['text', 'label'], num_rows: 72000 }) test: Dataset({ features: ['text', 'label'], num_rows: 9000 }) validation: Dataset({ features: ['text', 'label'], num_rows: 9000 }) ## Dataset Creation [Link to the notebook](https://github.com/tatiana-merz/YakuToolkit/blob/main/CyrillicTurkicCorpus.ipynb) ### Curation Rationale [More Information Needed] ### Source Data ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]
提供机构:
tatiana-merz
原始信息汇总

Cyrillic dataset of 8 Turkic languages spoken in Russia and former USSR

数据集描述

该数据集是[Leipzig Corpora (Wiki) Collection]的一部分,用于文本分类比较,其中包含俄语。

数据集摘要

支持的任务和排行榜

语言

  • ba - Bashkir
  • cv - Chuvash
  • sah - Sakha
  • tt - Tatar
  • ky - Kyrgyz
  • kk - Kazakh
  • tyv - Tuvinian
  • krc - Karachay-Balkar
  • ru - Russian

数据分割

  • 训练集: 包含72000条记录,特征为text和label。
  • 测试集: 包含9000条记录,特征为text和label。
  • 验证集: 包含9000条记录,特征为text和label。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作