five

tahrirchi/uz-books-v2

收藏
Hugging Face2026-04-09 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/tahrirchi/uz-books-v2
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: text dtype: string splits: - name: lat num_bytes: 12629576314 num_examples: 38339 - name: cyr num_bytes: 21037367671 num_examples: 38339 download_size: 16039208335 dataset_size: 33666943985 configs: - config_name: default data_files: - split: lat path: data/lat-* - split: cyr path: data/cyr-* license: mit task_categories: - text-generation - fill-mask language: - uz tags: - uz-books - uz - books pretty_name: UzBookv2 size_categories: - 10K<n<100K --- # Dataset Card for UzBooks V2 ## Dataset Summary UzBooks V2 is an improved version of the [UzBooks](https://huggingface.co/datasets/tahrirchi/uz-books) book corpus for Uzbek language. It contains nearly **40,000 books** in two splits: | Split | Description | Examples | |-------|-------------|----------| | **lat** | Fully Latin-transliterated version | 38,339 | | **cyr** | Fully Cyrillic-transliterated version | 38,339 | ### What's New in V2? - **OCR Engine Upgrade**: Switched from **Tesseract** → **Google Cloud Vision OCR** - **Cleaner Text**: Google OCR produces far fewer recognition errors, especially for mixed-script content - **Same Structure & Size**: Maintains compatibility with v1 — same splits, same number of examples ## Usage ```python from datasets import load_dataset uz_books2 = load_dataset("tahrirchi/uz-books-v2") # Access Latin version print(uz_books2["lat"][0]["text"]) # Access Cyrillic version print(uz_books2["lat"][0]["text"]) ``` ## Data Fields | Field | Type | Description | |-------|------|-------------| | `text` | `string` | Full text content of the book | ## Dataset Creation Books were collected from various public sources and processed using **Google Cloud Vision OCR**, which delivers substantially better accuracy than Tesseract for Uzbek text — particularly in handling the coexistence of Latin and Cyrillic scripts. Then, `lat` and `cyr` splits were generated using curated transliteration scripts. ## Citation ```bibtex @online{Mamasaidov2024UzBooksV2, author = {Mukhammadsaid Mamasaidov and Abror Shopulatov}, title = {UzBooks V2 dataset}, year = {2026}, url = {https://huggingface.co/datasets/tahrirchi/uz-books-v2} } ``` ## Contacts We believe that this work will enable and inspire all enthusiasts around the world to open the hidden beauty of low-resource languages, in particular Uzbek. For questions or issues: - m.mamasaidov@tahrirchi.uz - a.shopolatov@tahrirchi.uz
提供机构:
tahrirchi
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作