Zarinaaa/commoncrawl_dataset

Name: Zarinaaa/commoncrawl_dataset
Creator: Zarinaaa
Published: 2026-02-14 11:12:14
License: 暂无描述

Hugging Face2026-02-14 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Zarinaaa/commoncrawl_dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc0-1.0 tags: - kyrgyz - common-crawl - web-scraping - text-corpus - low-resource-languages - nlp - turkic-languages language: - ky task_categories: - text-generation - fill-mask - text-classification size_categories: - 100M<n<1B pretty_name: Kyrgyz CommonCrawl Text Corpus --- # Kyrgyz CommonCrawl Dataset A **271 MB** text corpus of Kyrgyz language data extracted from [CommonCrawl](https://commoncrawl.org/) — one of the largest openly available Kyrgyz text collections for NLP research. --- ## Dataset Description This dataset contains Kyrgyz-language web text scraped from CommonCrawl archives, filtered by the Kyrgyz language tag (`ky`). The data covers a wide range of domains including news, blogs, government sites, educational content, and general web pages. **Why this matters:** Kyrgyz is a low-resource Turkic language spoken by ~7 million people. High-quality text corpora are essential for training language models, yet very few large-scale Kyrgyz datasets exist publicly. --- ## Dataset Summary | Property | Value | |----------|-------| | **Total size** | 271 MB | | **Language** | Kyrgyz (ky) | | **Format** | CSV | | **Source** | CommonCrawl (filtered by `ky` language tag) | | **Files** | 31 CSV files | | **License** | CC0 (public domain) | --- ## File Structure | File | Size | Description | |------|------|-------------| | `data_MN.csv` | 29.6 MB | Large text segment | | `data.csv` | 9.98 MB | General web text | | `data_bilesinbi.csv` | 7.11 MB | Domain-specific data | | `Merged file2.csv` | 1.66 MB | Merged text segments | | `data_f.csv` | 703 kB | Filtered subset | | `8april_final.csv` | 634 kB | Cleaned snapshot | | `data.numbers` | 581 kB | Statistics/metadata | | `bia.csv` | 37.6 kB | Small subset | | `data_ecoproduct.csv` | 19.9 kB | Eco/product domain | | ... | ... | Additional CSV files | --- ## Use Cases - **Language model pretraining** — Training or fine-tuning LLMs for Kyrgyz (e.g., GPT, BERT, LLaMA) - **Text classification** — Building Kyrgyz text classifiers - **Machine translation** — Source data for Kyrgyz ↔ other language pairs - **Linguistic research** — Studying modern Kyrgyz web language usage - **Punctuation / grammar models** — Training data for text normalization tools - **NER & information extraction** — Building Kyrgyz entity recognizers --- ## Data Collection The data was collected by: 1. Querying CommonCrawl archives for pages tagged with the Kyrgyz language identifier (`ky`) 2. Extracting text content from the matched web pages 3. Cleaning and organizing into CSV format 4. Deduplication and quality filtering --- ## Preprocessing Recommendations Before using this dataset, consider: - **Deduplication** — Web-crawled data often contains duplicate paragraphs across pages - **Language verification** — Some pages may contain mixed-language content (Kyrgyz + Russian is common) - **Quality filtering** — Remove boilerplate (navigation menus, footers, cookie notices) - **Encoding normalization** — Ensure consistent Cyrillic encoding (UTF-8) --- ## Limitations - **Web-crawled data** may contain noise, boilerplate HTML artifacts, and mixed-language content - **No manual curation** — quality varies across files - **Potential duplicates** across different CSV files - **Bias toward web-present content** — overrepresentation of news and government text, underrepresentation of informal speech --- ## Related Resources - 🤗 [Kyrgyz Punctuation Model](https://huggingface.co/Zarinaaa/punctuator_model) — Trained using data from this corpus - 🤗 [Kyrgyz Morphological Analysis](https://huggingface.co/Zarinaaa/morphological_analysis) — BERT-based morphological tagger --- ## Citation ```bibtex @dataset{uvalieva2024kyrgyz_commoncrawl, author = {Uvalieva, Zarina}, title = {Kyrgyz CommonCrawl Text Corpus}, year = {2024}, url = {https://huggingface.co/datasets/Zarinaaa/commoncrawl_dataset} } ``` --- ## Author **Zarina Uvalieva** — ML Engineer specializing in NLP for low-resource languages. - 🤗 [HuggingFace](https://huggingface.co/Zarinaaa)

提供机构：

Zarinaaa

5,000+

优质数据集

54 个

任务类型

进入经典数据集