podarok/kobza-cleaned-ua
收藏Hugging Face2025-12-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/podarok/kobza-cleaned-ua
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- uk
size_categories:
- 1M<n<10M
task_categories:
- text-generation
tags:
- ukrainian
- web-crawl
- cleaned
- language-modeling
dataset_info:
features:
- name: text
dtype: string
- name: source
dtype: string
- name: length
dtype: int64
splits:
- name: train
num_bytes: 536853186
num_examples: 1812460
download_size: 248201329
dataset_size: 536853186
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# kobza-cleaned-ua
Cleaned Ukrainian-language subset of [Goader/kobza](https://huggingface.co/datasets/Goader/kobza) dataset with Russian content filtered out.
## Dataset Details
### Dataset Description
This dataset is a cleaned and filtered version of the Goader/kobza corpus, removing Russian language content to create a pure Ukrainian language dataset suitable for training language models.
The original kobza dataset contains ~60B tokens across 97 million documents. This cleaned version maintains ~59B tokens (98.5% retention) after removing Russian content detected through character patterns, word lists, and grammatical structures.
- **Curated by:** podarok
- **Language(s):** Ukrainian (uk)
- **License:** CC-BY-4.0
### Dataset Sources
- **Repository:** https://github.com/podarok/ua_ai_v2
- **Parent Dataset:** [Goader/kobza](https://huggingface.co/datasets/Goader/kobza)
## Uses
### Direct Use
This dataset is intended for:
- Pre-training Ukrainian language models
- Fine-tuning multilingual models on Ukrainian text
- Ukrainian NLP research and development
- Training text generation models
### Out-of-Scope Use
This dataset should not be used for:
- Applications requiring guaranteed absence of any Russian text (some edge cases may remain)
- Applications requiring quality scoring (no quality scores provided)
- Real-time applications (streaming recommended for large-scale use)
## Dataset Structure
The dataset contains 1,812,460 documents with the following schema:
```python
{
"text": str, # Document text
"source": str, # One of: hplt-2.0, fineweb-2, cultura-x, ubertext2.0, ukrainian-news
"length": int # Character count
}
```
### Source Distribution
| Source | Documents | Size (MB) | Russian Removed |
|--------|-----------|-----------|-----------------|
| hplt-2.0 | 641,685 | 160.6 | 1.77% |
| fineweb-2 | 539,769 | 154.0 | 1.77% |
| cultura-x | 383,117 | 119.6 | 2.23% |
| ubertext2.0 | 192,010 | 21.6 | 0.20% |
| ukrainian-news | 55,879 | 14.8 | 0.88% |
| **Total** | **1,812,460** | **470.6** | **~1.5%** |
## Dataset Creation
### Curation Rationale
The original Goader/kobza dataset, while being the largest Ukrainian corpus, contains some Russian language content due to:
1. Mixed-language websites
2. Code-switching in web content
3. Multilingual web crawls
This cleaned version was created to provide a higher-quality, Ukrainian-only corpus for language model training.
### Source Data
#### Data Collection and Processing
The source data comes from the [Goader/kobza](https://huggingface.co/datasets/Goader/kobza) dataset, which aggregates text from:
- **hplt-2.0**: High-quality web crawl data
- **fineweb-2**: Curated web content
- **cultura-x**: Cultural and literary texts
- **ubertext2.0**: Ukrainian language corpus
- **ukrainian-news**: News articles
**Filtering Process:**
1. **Russian-only characters**: Detection of Cyrillic characters exclusive to Russian (ы, э, ё, Э, Ы, Ё)
2. **Russian word patterns**: Removal of common Russian-only words (и, что, да, можно, etc.)
3. **Mixed-language detection**: Filtering sentences with Russian grammatical patterns
4. **Line-by-line filtering**: Each line evaluated independently
#### Who are the source data producers?
The source data was originally curated by [Goader](https://huggingface.co/Goader) from various web sources. The cleaning and filtering was performed by the ua_ai_v2 project team.
### Annotations
This dataset does not contain annotations beyond the source metadata.
#### Personal and Sensitive Information
The dataset inherits any personal or sensitive information present in the original Goader/kobza dataset. Users should refer to the [original dataset's documentation](https://huggingface.co/datasets/Goader/kobza) for details on privacy considerations.
## Bias, Risks, and Limitations
- **Incomplete filtering**: Some Russian text may remain due to edge cases or similar words between languages
- **No quality scores**: Unlike some corpora, this dataset does not include document-level quality scores
- **Source bias**: Inherits any biases present in the original web crawl sources
- **Temporal bias**: Reflects web content from the time period of the original crawl
- **Domain distribution**: Web content is heavily represented compared to other text types
### Recommendations
Users should:
- Validate the dataset's suitability for their specific use case
- Consider combining with other Ukrainian corpora for better domain coverage
- Apply additional quality filtering if needed for production use
- Be aware that some Russian content may remain in edge cases
## Usage
### Basic Loading
```python
from datasets import load_dataset
dataset = load_dataset("podarok/kobza-cleaned-ua", split="train")
print(f"Total documents: {len(dataset):,}")
```
### Streaming Mode (Recommended for Large-Scale Training)
```python
from datasets import load_dataset
dataset = load_dataset("podarok/kobza-cleaned-ua", split="train", streaming=True)
for doc in dataset:
print(doc["text"])
```
### Filter by Source
```python
# Get only news content
news_dataset = dataset.filter(lambda x: x["source"] == "ukrainian-news")
# Get multiple sources
web_dataset = dataset.filter(
lambda x: x["source"] in ["hplt-2.0", "fineweb-2"]
)
```
## Citation
### Original Dataset
```bibtex
@misc{kobza2024,
title={Kobza: Ukrainian Language Corpus},
author={Goader},
year={2024},
url={https://huggingface.co/datasets/Goader/kobza}
}
```
### This Dataset
If you use this cleaned version, please cite both the original kobza dataset and reference this filtered version.
## Dataset Card Authors
- podarok (cleaning and curation)
## Dataset Card Contact
For questions or issues, please open an issue at https://github.com/podarok/ua_ai_v2
提供机构:
podarok



