tahrirchi/uz-books-v2
收藏Hugging Face2026-04-09 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/tahrirchi/uz-books-v2
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: text
dtype: string
splits:
- name: lat
num_bytes: 12629576314
num_examples: 38339
- name: cyr
num_bytes: 21037367671
num_examples: 38339
download_size: 16039208335
dataset_size: 33666943985
configs:
- config_name: default
data_files:
- split: lat
path: data/lat-*
- split: cyr
path: data/cyr-*
license: mit
task_categories:
- text-generation
- fill-mask
language:
- uz
tags:
- uz-books
- uz
- books
pretty_name: UzBookv2
size_categories:
- 10K<n<100K
---
# Dataset Card for UzBooks V2
## Dataset Summary
UzBooks V2 is an improved version of the [UzBooks](https://huggingface.co/datasets/tahrirchi/uz-books) book corpus for Uzbek language. It contains nearly **40,000 books** in two splits:
| Split | Description | Examples |
|-------|-------------|----------|
| **lat** | Fully Latin-transliterated version | 38,339 |
| **cyr** | Fully Cyrillic-transliterated version | 38,339 |
### What's New in V2?
- **OCR Engine Upgrade**: Switched from **Tesseract** → **Google Cloud Vision OCR**
- **Cleaner Text**: Google OCR produces far fewer recognition errors, especially for mixed-script content
- **Same Structure & Size**: Maintains compatibility with v1 — same splits, same number of examples
## Usage
```python
from datasets import load_dataset
uz_books2 = load_dataset("tahrirchi/uz-books-v2")
# Access Latin version
print(uz_books2["lat"][0]["text"])
# Access Cyrillic version
print(uz_books2["lat"][0]["text"])
```
## Data Fields
| Field | Type | Description |
|-------|------|-------------|
| `text` | `string` | Full text content of the book |
## Dataset Creation
Books were collected from various public sources and processed using **Google Cloud Vision OCR**, which delivers substantially better accuracy than Tesseract for Uzbek text — particularly in handling the coexistence of Latin and Cyrillic scripts. Then, `lat` and `cyr` splits were generated using curated transliteration scripts.
## Citation
```bibtex
@online{Mamasaidov2024UzBooksV2,
author = {Mukhammadsaid Mamasaidov and Abror Shopulatov},
title = {UzBooks V2 dataset},
year = {2026},
url = {https://huggingface.co/datasets/tahrirchi/uz-books-v2}
}
```
## Contacts
We believe that this work will enable and inspire all enthusiasts around the world to open the hidden beauty of low-resource languages, in particular Uzbek.
For questions or issues:
- m.mamasaidov@tahrirchi.uz
- a.shopolatov@tahrirchi.uz
提供机构:
tahrirchi



