five

Zarinaaa/commoncrawl_dataset

收藏
Hugging Face2026-02-14 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Zarinaaa/commoncrawl_dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc0-1.0 tags: - kyrgyz - common-crawl - web-scraping - text-corpus - low-resource-languages - nlp - turkic-languages language: - ky task_categories: - text-generation - fill-mask - text-classification size_categories: - 100M<n<1B pretty_name: Kyrgyz CommonCrawl Text Corpus --- # Kyrgyz CommonCrawl Dataset A **271 MB** text corpus of Kyrgyz language data extracted from [CommonCrawl](https://commoncrawl.org/) — one of the largest openly available Kyrgyz text collections for NLP research. --- ## Dataset Description This dataset contains Kyrgyz-language web text scraped from CommonCrawl archives, filtered by the Kyrgyz language tag (`ky`). The data covers a wide range of domains including news, blogs, government sites, educational content, and general web pages. **Why this matters:** Kyrgyz is a low-resource Turkic language spoken by ~7 million people. High-quality text corpora are essential for training language models, yet very few large-scale Kyrgyz datasets exist publicly. --- ## Dataset Summary | Property | Value | |----------|-------| | **Total size** | 271 MB | | **Language** | Kyrgyz (ky) | | **Format** | CSV | | **Source** | CommonCrawl (filtered by `ky` language tag) | | **Files** | 31 CSV files | | **License** | CC0 (public domain) | --- ## File Structure | File | Size | Description | |------|------|-------------| | `data_MN.csv` | 29.6 MB | Large text segment | | `data.csv` | 9.98 MB | General web text | | `data_bilesinbi.csv` | 7.11 MB | Domain-specific data | | `Merged file2.csv` | 1.66 MB | Merged text segments | | `data_f.csv` | 703 kB | Filtered subset | | `8april_final.csv` | 634 kB | Cleaned snapshot | | `data.numbers` | 581 kB | Statistics/metadata | | `bia.csv` | 37.6 kB | Small subset | | `data_ecoproduct.csv` | 19.9 kB | Eco/product domain | | ... | ... | Additional CSV files | --- ## Use Cases - **Language model pretraining** — Training or fine-tuning LLMs for Kyrgyz (e.g., GPT, BERT, LLaMA) - **Text classification** — Building Kyrgyz text classifiers - **Machine translation** — Source data for Kyrgyz ↔ other language pairs - **Linguistic research** — Studying modern Kyrgyz web language usage - **Punctuation / grammar models** — Training data for text normalization tools - **NER & information extraction** — Building Kyrgyz entity recognizers --- ## Data Collection The data was collected by: 1. Querying CommonCrawl archives for pages tagged with the Kyrgyz language identifier (`ky`) 2. Extracting text content from the matched web pages 3. Cleaning and organizing into CSV format 4. Deduplication and quality filtering --- ## Preprocessing Recommendations Before using this dataset, consider: - **Deduplication** — Web-crawled data often contains duplicate paragraphs across pages - **Language verification** — Some pages may contain mixed-language content (Kyrgyz + Russian is common) - **Quality filtering** — Remove boilerplate (navigation menus, footers, cookie notices) - **Encoding normalization** — Ensure consistent Cyrillic encoding (UTF-8) --- ## Limitations - **Web-crawled data** may contain noise, boilerplate HTML artifacts, and mixed-language content - **No manual curation** — quality varies across files - **Potential duplicates** across different CSV files - **Bias toward web-present content** — overrepresentation of news and government text, underrepresentation of informal speech --- ## Related Resources - 🤗 [Kyrgyz Punctuation Model](https://huggingface.co/Zarinaaa/punctuator_model) — Trained using data from this corpus - 🤗 [Kyrgyz Morphological Analysis](https://huggingface.co/Zarinaaa/morphological_analysis) — BERT-based morphological tagger --- ## Citation ```bibtex @dataset{uvalieva2024kyrgyz_commoncrawl, author = {Uvalieva, Zarina}, title = {Kyrgyz CommonCrawl Text Corpus}, year = {2024}, url = {https://huggingface.co/datasets/Zarinaaa/commoncrawl_dataset} } ``` --- ## Author **Zarina Uvalieva** — ML Engineer specializing in NLP for low-resource languages. - 🤗 [HuggingFace](https://huggingface.co/Zarinaaa)
提供机构:
Zarinaaa
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作