aslon1213/uzbek-language-corpus
收藏Hugging Face2025-12-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/aslon1213/uzbek-language-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: text
dtype: string
- name: chars
dtype: int64
- name: words
dtype: int64
splits:
- name: train
num_bytes: 273029022
num_examples: 1212450
download_size: 156336102
dataset_size: 273029022
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# Uzbek Language Corpus
This dataset is a processed and filtered version of the Uzbek language text corpus, containing over 1.2 million text samples in Uzbek.
## Dataset Description
This corpus contains Uzbek text data that has been cleaned and processed for natural language processing tasks. The dataset includes text samples with their corresponding character and word counts.
## Dataset Statistics
- **Total samples**: 1,212,450
- **Features**:
- `text`: The Uzbek text content
- `chars`: Number of characters in the text
- `words`: Number of words in the text
## Data Processing
The dataset has been processed with the following steps:
1. Text extraction and cleaning
2. Character and word count calculation
3. Filtering out duplicate patterns (removed entries with "{word} va {word}" pattern)
4. Quality filtering to ensure meaningful text content
## Source Dataset
This dataset is derived from the original Uzbek language dataset available at:
[hf.co/xkas2001/uzbek-language-dataset](https://huggingface.co/datasets/xkas2001/uzbek-language-dataset)
提供机构:
aslon1213



