tahrirchi/uz-books-v2

Name: tahrirchi/uz-books-v2
Creator: tahrirchi
Published: 2026-04-09 14:33:23
License: 暂无描述

Hugging Face2026-04-09 更新2026-05-10 收录

下载链接：

https://hf-mirror.com/datasets/tahrirchi/uz-books-v2

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: text dtype: string splits: - name: lat num_bytes: 12629576314 num_examples: 38339 - name: cyr num_bytes: 21037367671 num_examples: 38339 download_size: 16039208335 dataset_size: 33666943985 configs: - config_name: default data_files: - split: lat path: data/lat-* - split: cyr path: data/cyr-* license: mit task_categories: - text-generation - fill-mask language: - uz tags: - uz-books - uz - books pretty_name: UzBookv2 size_categories: - 10K<n<100K --- # Dataset Card for UzBooks V2 ## Dataset Summary UzBooks V2 is an improved version of the [UzBooks](https://huggingface.co/datasets/tahrirchi/uz-books) book corpus for Uzbek language. It contains nearly **40,000 books** in two splits: | Split | Description | Examples | |-------|-------------|----------| | **lat** | Fully Latin-transliterated version | 38,339 | | **cyr** | Fully Cyrillic-transliterated version | 38,339 | ### What's New in V2? - **OCR Engine Upgrade**: Switched from **Tesseract** → **Google Cloud Vision OCR** - **Cleaner Text**: Google OCR produces far fewer recognition errors, especially for mixed-script content - **Same Structure & Size**: Maintains compatibility with v1 — same splits, same number of examples ## Usage ```python from datasets import load_dataset uz_books2 = load_dataset("tahrirchi/uz-books-v2") # Access Latin version print(uz_books2["lat"][0]["text"]) # Access Cyrillic version print(uz_books2["lat"][0]["text"]) ``` ## Data Fields | Field | Type | Description | |-------|------|-------------| | `text` | `string` | Full text content of the book | ## Dataset Creation Books were collected from various public sources and processed using **Google Cloud Vision OCR**, which delivers substantially better accuracy than Tesseract for Uzbek text — particularly in handling the coexistence of Latin and Cyrillic scripts. Then, `lat` and `cyr` splits were generated using curated transliteration scripts. ## Citation ```bibtex @online{Mamasaidov2024UzBooksV2, author = {Mukhammadsaid Mamasaidov and Abror Shopulatov}, title = {UzBooks V2 dataset}, year = {2026}, url = {https://huggingface.co/datasets/tahrirchi/uz-books-v2} } ``` ## Contacts We believe that this work will enable and inspire all enthusiasts around the world to open the hidden beauty of low-resource languages, in particular Uzbek. For questions or issues: - m.mamasaidov@tahrirchi.uz - a.shopolatov@tahrirchi.uz

提供机构：

tahrirchi

5,000+

优质数据集

54 个

任务类型

进入经典数据集