five

yasalma/tt-books-cyrillic

收藏
Hugging Face2024-12-07 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/yasalma/tt-books-cyrillic
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - tt tags: - tt - tatar - books - monocorpus pretty_name: Collection of books in Tatar language in Cyrillic script --- # Tatar Books Collection (Cyrillic) 📚 This dataset, hosted by [Yasalma](https://huggingface.co/neurotatarlar), is a curated collection of 497 Tatar books in Parquet format. The texts are in Cyrillic script, making this dataset ideal for linguistic research, language modeling, and other NLP applications in the Tatar language. ## Dataset Details - **Language**: Tatar (Cyrillic script) - **Format**: Two Parquet files - Original text - Markdown-formatted text - **Columns**: - train-00000-of-00001.parquet: - `file_name`: The original name of each book’s file - `text`: The full content of each book in raw text - lib-books.parquet: - `text`: The full content of each book in Markdown format - **Important Note**: The books in the two files do not overlap; they are entirely distinct collections. - **Total Number of Books**: 497 - **Total Size**: 180 MB - **License**: MIT ### Structure The dataset is organized as follows: - **train-00000-of-00001.parquet**: Each row represents an individual Tatar book, with columns for the book’s filename (`file_name`) and its content in raw text (`text`). - **lib-books.parquet**: Each row represents an individual Tatar book, with single column in Markdown format(`text`). All links to images have been removed from the Markdown text to ensure compatibility and simplify processing. ## Potential Use Cases - **Language Modeling**: Train language models specifically for Tatar in Cyrillic script. - **Markdown Processing**: Use Markdown-formatted text for specific NLP applications, such as HTML rendering or structured content analysis. - **Machine Translation**: Use the dataset for translation tasks. - **Linguistic Research**: Study linguistic structures, grammar, and vocabulary in Tatar. ## Usage To load the dataset using Hugging Face’s `datasets` library: ```python from datasets import load_dataset dataset = load_dataset("neurotatarlar/tt-books-cyrillic") ``` ## Contributions and Acknowledgements This dataset is maintained by the Yasalma team. Contributions, feedback, and suggestions are welcome to improve and expand the dataset.
提供机构:
yasalma
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作