five

podarok/kobza-cleaned-ua

收藏
Hugging Face2025-12-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/podarok/kobza-cleaned-ua
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - uk size_categories: - 1M<n<10M task_categories: - text-generation tags: - ukrainian - web-crawl - cleaned - language-modeling dataset_info: features: - name: text dtype: string - name: source dtype: string - name: length dtype: int64 splits: - name: train num_bytes: 536853186 num_examples: 1812460 download_size: 248201329 dataset_size: 536853186 configs: - config_name: default data_files: - split: train path: data/train-* --- # kobza-cleaned-ua Cleaned Ukrainian-language subset of [Goader/kobza](https://huggingface.co/datasets/Goader/kobza) dataset with Russian content filtered out. ## Dataset Details ### Dataset Description This dataset is a cleaned and filtered version of the Goader/kobza corpus, removing Russian language content to create a pure Ukrainian language dataset suitable for training language models. The original kobza dataset contains ~60B tokens across 97 million documents. This cleaned version maintains ~59B tokens (98.5% retention) after removing Russian content detected through character patterns, word lists, and grammatical structures. - **Curated by:** podarok - **Language(s):** Ukrainian (uk) - **License:** CC-BY-4.0 ### Dataset Sources - **Repository:** https://github.com/podarok/ua_ai_v2 - **Parent Dataset:** [Goader/kobza](https://huggingface.co/datasets/Goader/kobza) ## Uses ### Direct Use This dataset is intended for: - Pre-training Ukrainian language models - Fine-tuning multilingual models on Ukrainian text - Ukrainian NLP research and development - Training text generation models ### Out-of-Scope Use This dataset should not be used for: - Applications requiring guaranteed absence of any Russian text (some edge cases may remain) - Applications requiring quality scoring (no quality scores provided) - Real-time applications (streaming recommended for large-scale use) ## Dataset Structure The dataset contains 1,812,460 documents with the following schema: ```python { "text": str, # Document text "source": str, # One of: hplt-2.0, fineweb-2, cultura-x, ubertext2.0, ukrainian-news "length": int # Character count } ``` ### Source Distribution | Source | Documents | Size (MB) | Russian Removed | |--------|-----------|-----------|-----------------| | hplt-2.0 | 641,685 | 160.6 | 1.77% | | fineweb-2 | 539,769 | 154.0 | 1.77% | | cultura-x | 383,117 | 119.6 | 2.23% | | ubertext2.0 | 192,010 | 21.6 | 0.20% | | ukrainian-news | 55,879 | 14.8 | 0.88% | | **Total** | **1,812,460** | **470.6** | **~1.5%** | ## Dataset Creation ### Curation Rationale The original Goader/kobza dataset, while being the largest Ukrainian corpus, contains some Russian language content due to: 1. Mixed-language websites 2. Code-switching in web content 3. Multilingual web crawls This cleaned version was created to provide a higher-quality, Ukrainian-only corpus for language model training. ### Source Data #### Data Collection and Processing The source data comes from the [Goader/kobza](https://huggingface.co/datasets/Goader/kobza) dataset, which aggregates text from: - **hplt-2.0**: High-quality web crawl data - **fineweb-2**: Curated web content - **cultura-x**: Cultural and literary texts - **ubertext2.0**: Ukrainian language corpus - **ukrainian-news**: News articles **Filtering Process:** 1. **Russian-only characters**: Detection of Cyrillic characters exclusive to Russian (ы, э, ё, Э, Ы, Ё) 2. **Russian word patterns**: Removal of common Russian-only words (и, что, да, можно, etc.) 3. **Mixed-language detection**: Filtering sentences with Russian grammatical patterns 4. **Line-by-line filtering**: Each line evaluated independently #### Who are the source data producers? The source data was originally curated by [Goader](https://huggingface.co/Goader) from various web sources. The cleaning and filtering was performed by the ua_ai_v2 project team. ### Annotations This dataset does not contain annotations beyond the source metadata. #### Personal and Sensitive Information The dataset inherits any personal or sensitive information present in the original Goader/kobza dataset. Users should refer to the [original dataset's documentation](https://huggingface.co/datasets/Goader/kobza) for details on privacy considerations. ## Bias, Risks, and Limitations - **Incomplete filtering**: Some Russian text may remain due to edge cases or similar words between languages - **No quality scores**: Unlike some corpora, this dataset does not include document-level quality scores - **Source bias**: Inherits any biases present in the original web crawl sources - **Temporal bias**: Reflects web content from the time period of the original crawl - **Domain distribution**: Web content is heavily represented compared to other text types ### Recommendations Users should: - Validate the dataset's suitability for their specific use case - Consider combining with other Ukrainian corpora for better domain coverage - Apply additional quality filtering if needed for production use - Be aware that some Russian content may remain in edge cases ## Usage ### Basic Loading ```python from datasets import load_dataset dataset = load_dataset("podarok/kobza-cleaned-ua", split="train") print(f"Total documents: {len(dataset):,}") ``` ### Streaming Mode (Recommended for Large-Scale Training) ```python from datasets import load_dataset dataset = load_dataset("podarok/kobza-cleaned-ua", split="train", streaming=True) for doc in dataset: print(doc["text"]) ``` ### Filter by Source ```python # Get only news content news_dataset = dataset.filter(lambda x: x["source"] == "ukrainian-news") # Get multiple sources web_dataset = dataset.filter( lambda x: x["source"] in ["hplt-2.0", "fineweb-2"] ) ``` ## Citation ### Original Dataset ```bibtex @misc{kobza2024, title={Kobza: Ukrainian Language Corpus}, author={Goader}, year={2024}, url={https://huggingface.co/datasets/Goader/kobza} } ``` ### This Dataset If you use this cleaned version, please cite both the original kobza dataset and reference this filtered version. ## Dataset Card Authors - podarok (cleaning and curation) ## Dataset Card Contact For questions or issues, please open an issue at https://github.com/podarok/ua_ai_v2
提供机构:
podarok
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作