kurumikz/Zerde-QA-Wiki-20K

Name: kurumikz/Zerde-QA-Wiki-20K
Creator: kurumikz
Published: 2026-04-10 12:06:04
License: 暂无描述

Hugging Face2026-04-10 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/kurumikz/Zerde-QA-Wiki-20K

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - kk - kaz license: odc-by pretty_name: "Zerde-QA-Wiki-20K — Kazakh Wikipedia Question Answering Dataset" task_categories: - question-answering - text-generation task_ids: - open-domain-qa - extractive-qa - closed-book-qa tags: - kazakh - kk - kaz - wikipedia - qa - instruction-tuning - reading-comprehension - nlp - central-asia - llm-training - zerde - zerde-series - open-domain - analytical-qa - encyclopedic annotations_creators: - machine-generated language_creators: - found multilinguality: - monolingual source_datasets: - kurumikz/Cleaned-Kazakh-Wikipedia size_categories: - 10K<n<100K format: - jsonl modality: - text library: - datasets - pandas - polars - mlcroissant --- <div align="center"> # Z E R D E — QA — Wiki — 20K **Knowledge-Grounded Kazakh Question Answering Dataset** Part of the **[Zerde Series](https://huggingface.co/collections/kurumikz/zerde)** · Built by **[@kurumikz](https://huggingface.co/kurumikz)** `kk / kaz` · `20,002 pairs` · `23 categories` · `6 question types` · `ODC-BY` --- *Every question. Every answer. Grounded in real Kazakh encyclopedic text.* </div> --- ## Overview **Zerde-QA-Wiki-20K** is an encyclopedic, knowledge-grounded question answering dataset written entirely in the **Kazakh language** (ISO 639-1: `kk`, ISO 639-2: `kaz`). Unlike purely synthetic QA datasets, every question-answer pair originates from a real article taken from **[kurumikz/Cleaned-Kazakh-Wikipedia](https://huggingface.co/datasets/kurumikz/Cleaned-Kazakh-Wikipedia)** — a curated corpus of 228,810 Kazakh-language encyclopedia articles. No translations, no cross-lingual data, no fabricated content: all text is natively Kazakh, written in Cyrillic script. The dataset spans **23 thematic categories** — from Kazakh history, linguistics, and literature to physics, medicine, and AI — reflecting the full breadth of knowledge preserved in the Kazakh Wikipedia. Each record contains a well-formed analytical question and a substantive answer grounded exclusively in the source article. **Zerde-QA-Wiki-20K is the second release in the Zerde dataset family.** The first was [Zerde-QA-50K](https://huggingface.co/datasets/kurumikz/Zerde-QA-50K), a large-scale synthetic QA dataset focused on technical and STEM domains. Where Zerde-QA-50K offers depth in structured technical knowledge, Zerde-QA-Wiki-20K extends coverage into humanities, history, natural sciences, and social disciplines through real encyclopedic sources. --- ## 📋 Dataset at a Glance | Property | Value | |:---|:---| | **Language** | Kazakh (`kk` / `kaz`) | | **Script** | Cyrillic (99.4%) · Latin (0.6%) | | **Total QA pairs** | 20,002 | | **Total words** | 2,694,726 (~2.69M) | | **Total characters** | 23,065,485 (~23.1M) | | **Kazakh characters** | 19,693,469 (~19.7M) | | **Sentences** | 133,231 (~20 words/sentence) | | **Avg. words per record** | ~135 | | **File size** | 43.8 MB (JSONL) | | **Primary source** | [kurumikz/Cleaned-Kazakh-Wikipedia](https://huggingface.co/datasets/kurumikz/Cleaned-Kazakh-Wikipedia) | | **Thematic categories** | 23 | | **Question types** | 6 analytical types | | **Avg. question length** | 18.5 words | | **Question length range** | 8 – 77 words | | **Avg. answer length** | 112.3 words | | **Answer length range** | 50 – 224 words | | **License** | ODC-BY | | **Series** | Zerde | --- ## 📊 Script Distribution | Script | Characters | Share | |:---|---:|---:| | Cyrillic (Kazakh) | 19,600,000+ | 99.4% | | Latin | 111,710 | 0.6% | | CJK (Chinese / Japanese) | 72 | < 0.01% | | Arabic | 67 | < 0.01% | | Korean (Hangul) | 19 | < 0.01% | | Hebrew | 19 | < 0.01% | | Hiragana | 7 | < 0.01% | The negligible presence of non-Cyrillic characters reflects proper nouns, scientific terminology, and quoted foreign terms within Kazakh encyclopedic articles. The dataset is effectively monolingual Kazakh Cyrillic. --- ## 🗂 Categories — Sorted by Record Count Categories are sorted by total record count. The dominance of history, earth sciences, and geopolitics reflects the actual distribution of articles in the Kazakh Wikipedia — categories that are richest in encyclopedic source material naturally yield more records. | # | Category | Records | Avg. Q (words) | Avg. A (words) | Max A | Min A | |:--|:---|---:|---:|---:|---:|---:| | 1 | Қазақстан тарихы және Түркітану | 1,500 | 22.0 | 121.4 | 192 | 65 | | 2 | Дүниежүзілік тарих және Өркениеттер | 1,500 | 20.4 | 118.5 | 204 | 57 | | 3 | Геосаясат және Халықаралық қатынастар | 1,500 | 17.9 | 115.6 | 219 | 60 | | 4 | Әдебиет және Мәдениеттану | 1,500 | 19.6 | 114.3 | 224 | 54 | | 5 | Классикалық Физика және Механика | 1,500 | 15.5 | 109.1 | 180 | 59 | | 6 | Биология және Өмір ғылымдары | 1,500 | 18.9 | 108.8 | 184 | 55 | | 7 | Жер туралы ғылымдар және Климатология | 2,459 | 21.4 | 116.1 | 204 | 50 | | 8 | Экономика және Қаржы | 1,374 | 17.3 | 111.2 | 181 | 65 | | 9 | Медицина және Денсаулық сақтау | 1,376 | 17.3 | 107.7 | 178 | 61 | | 10 | Лингвистика және Тіл білімі | 1,139 | 17.9 | 107.9 | 180 | 61 | | 11 | Химия және Материалтану | 907 | 16.6 | 105.3 | 166 | 61 | | 12 | Философия және Этика | 827 | 16.3 | 109.9 | 219 | 66 | | 13 | АТ және Жүйелік архитектура | 760 | 16.7 | 107.6 | 185 | 63 | | 14 | Экология және Қоршаған орта | 735 | 18.0 | 112.4 | 172 | 71 | | 15 | Орта Азия және Еуразия тарихы | 160 | 20.6 | 116.9 | 193 | 75 | | 16 | Бағдарламалау тілдері және Инженерия | 187 | 16.0 | 106.2 | 170 | 76 | | 17 | Кванттық Физика және Жоғары энергиялар | 186 | 16.4 | 108.6 | 150 | 62 | | 18 | Теориялық және Қолданбалы Математика | 376 | 17.5 | 105.5 | 162 | 59 | | 19 | Психология және Психиатрия | 321 | 15.5 | 105.8 | 164 | 63 | | 20 | Киберқауіпсіздік және Криптография | 99 | 16.3 | 108.8 | 149 | 55 | | 21 | Статистика және Ықтималдықтар теориясы | 36 | 16.1 | 112.4 | 158 | 90 | | 22 | Жасанды Интеллект, ML және Дата-сайнс | 22 | 15.9 | 102.3 | 118 | 84 | | 23 | Деректер базасы және Үлкен деректер | 25 | 16.4 | 106.8 | 148 | 85 | > **Note:** Categories #1–6 were capped at 1,500 records by design (balance gate). "Жер туралы ғылымдар" exceeded the cap because it was identified as a separate sub-category during generation and the combined count was reconciled post-collection. --- ## 🔬 Question Types Every question belongs to one of six analytical types, designed to elicit structured, reasoned answers rather than simple factual lookups. | Type | Description | |:---|:---| | **causal** | Analyze cause-and-effect relationships described in the article | | **comparative** | Compare two aspects, phenomena, periods, or systems from the text | | **evaluative** | Assess the significance, importance, or impact of a described phenomenon | | **applied** | Explain practical real-world consequences, uses, or applications | | **analytical** | Break down and explain an underlying mechanism, structure, or process | | **definitional** | Define a concept or term strictly based on information present in the text | --- ## 📦 Data Format Each record is stored as a JSON object on a single line in `.jsonl` format. ```json { "id": 1, "category": "Қазақстан тарихы және Түркітану", "question_type": "causal", "question": "Қазақ хандығының құрылуына қандай саяси және әлеуметтік жағдайлар себеп болды?", "answer": "Қазақ хандығы XV ғасырдың ортасында Керей мен Жәнібек сұлтандардың Өзбек хандығынан бөлінуі нәтижесінде пайда болды. Бұл бөлінудің негізгі себептері саяси бытыраңқылық, жайылым үшін бақталастық және жаңа этностық топтың — қазақтардың — жеке мемлекеттілікке деген ұмтылысы болды..." } ``` ### Field Reference | Field | Type | Constraints | Description | |:---|:---|:---|:---| | `id` | `int` | Sequential, unique | Unique record identifier | | `category` | `string` | One of 23 values | Thematic category inferred from source article | | `question_type` | `string` | One of 6 values | Analytical question type | | `question` | `string` | ≥ 12 words | Question in Kazakh Cyrillic | | `answer` | `string` | ≥ 300 characters | Answer in Kazakh, grounded in source article only | --- ## ⚙️ How the Dataset Was Built Zerde-QA-Wiki-20K was constructed through a **two-stage automated pipeline** operating on the [Cleaned Kazakh Wikipedia](https://huggingface.co/datasets/kurumikz/Cleaned-Kazakh-Wikipedia) — a corpus of 228,810 cleaned, namespace-0 Kazakh Wikipedia articles. **Stage 1 — Question Generation** Each Wikipedia article is read and analyzed in isolation. Based on the article's content, a thematic category is assigned and an analytical question is formulated — one that requires understanding the material rather than simply extracting a surface fact. The question type is selected to ensure variety across the dataset. **Stage 2 — Category Validation and Answer Generation** Before any answer is written, the proposed category assignment undergoes a dedicated validation step. If the article content does not genuinely match the assigned category, the entire record is discarded. Only records that pass this check proceed to answer generation. The answer is then written strictly from information present within the source article — no external knowledge is introduced. **Quality Gates applied at every record:** | Gate | Rule | |:---|:---| | Question length | ≥ 12 words | | Answer length | ≥ 300 characters | | Deduplication | Questions checked by prefix — near-duplicates discarded | | Category balance | Maximum 1,500 records per category | | Category validation | Dedicated validation step before answer generation | The result: every answer in the dataset is traceable to a real encyclopedic source, and the distribution across topics is deliberately balanced. --- ## 🚀 Quick Start ```python from datasets import load_dataset ds = load_dataset("kurumikz/Zerde-QA-Wiki-20K") print(ds["train"][0]) ``` **Filter by category:** ```python df = ds["train"].to_pandas() history = df[df["category"] == "Қазақстан тарихы және Түркітану"] print(f"{len(history)} records | avg answer: {history['answer'].str.split().str.len().mean():.1f} words") ``` **Filter by question type:** ```python causal = df[df["question_type"] == "causal"] print(causal.sample(5)["question"].values) ``` **Combine with Zerde-QA-50K:** ```python from datasets import load_dataset, concatenate_datasets wiki = load_dataset("kurumikz/Zerde-QA-Wiki-20K", split="train") synth = load_dataset("kurumikz/Zerde-QA-50K", split="train") combined = concatenate_datasets([wiki, synth]) print(f"Combined Zerde corpus: {len(combined):,} QA pairs") ``` --- ## 🆚 Zerde-QA-Wiki-20K vs Zerde-QA-50K | | [Zerde-QA-50K](https://huggingface.co/datasets/kurumikz/Zerde-QA-50K) | **Zerde-QA-Wiki-20K** | |:---|:---|:---| | **Source type** | Synthetic — topic-prompted generation | Real — Kazakh Wikipedia articles | | **Domain focus** | STEM, IT, AI, Mathematics | All knowledge domains (23 categories) | | **Text grounding** | Thematic prompts | Real encyclopedic articles | | **Answer style** | Technical, expository | Analytical, encyclopedic | | **Validation stages** | Single-stage | Two-stage (category + answer) | | **Total pairs** | 51,422 | 20,002 | | **Avg. answer length** | — | 112.3 words | | **Question depth** | Domain-specific | 6 structured analytical types | | **License** | ODC-BY | ODC-BY | Both datasets share the same language, script, and JSONL structure — they can be merged directly for broader coverage. --- ## 🗺 The Zerde Series **Zerde (Зерде)** — *intelligence*, *memory*, *keen mind* in Kazakh. The Zerde series is an ongoing effort to build open, high-quality, natively Kazakh datasets for LLM training, fine-tuning, and NLP research. Each release targets a different source type or domain, building toward a comprehensive open benchmark for the Kazakh language. | Dataset | Status | Size | Description | |:---|:---:|---:|:---| | [**Zerde-QA-50K**](https://huggingface.co/datasets/kurumikz/Zerde-QA-50K) | ✅ Released | 51,422 pairs | Synthetic QA across STEM and IT domains | | **Zerde-QA-Wiki-20K** | ✅ Released | 20,002 pairs | Encyclopedic QA from Kazakh Wikipedia | | Zerde-QA-News | 🔜 Planned | — | QA pairs from Kazakh news corpora | | Zerde-Instruct | 🔜 Planned | — | General-purpose instruction-following | | Zerde-QA-Law | 🔜 Planned | — | Kazakhstani legal and regulatory texts | --- ## 📚 Primary Source This dataset was built exclusively from: > **[kurumikz/Cleaned-Kazakh-Wikipedia](https://huggingface.co/datasets/kurumikz/Cleaned-Kazakh-Wikipedia)** > 228,810 articles · Kazakh Wikipedia (kkwiki) · Namespace 0 (main articles only) · Cyrillic script · JSONL No external knowledge sources, web scraping, or translation pipelines were used. All content is natively Kazakh and originates directly from the above corpus. --- ## 📜 License Released under the **Open Data Commons Attribution License (ODC-BY)**. You are free to use, share, and adapt this dataset for any purpose — including commercial use — as long as appropriate credit is given. **Citation:** ```bibtex @dataset{zerde_qa_wiki_20k_2025, author = {kurumikz}, title = {Zerde-QA-Wiki-20K: Knowledge-Grounded Kazakh Question Answering Dataset}, year = {2025}, publisher = {Hugging Face}, series = {Zerde}, url = {https://huggingface.co/datasets/kurumikz/Zerde-QA-Wiki-20K} } ``` --- --- # Z E R D E — QA — Wiki — 20K · Қазақша сипаттама **Білімге негізделген қазақша сұрақ-жауап датасеті** **[Zerde сериясының](https://huggingface.co/collections/kurumikz/zerde)** бөлігі · **[@kurumikz](https://huggingface.co/kurumikz)** --- ## Жалпы сипаттама **Zerde-QA-Wiki-20K** — толығымен **қазақ тілінде** жазылған, энциклопедиялық дереккөзге негізделген сұрақ-жауап датасеті. Датасеттегі әрбір жазба **[kurumikz/Cleaned-Kazakh-Wikipedia](https://huggingface.co/datasets/kurumikz/Cleaned-Kazakh-Wikipedia)** — 228 810 мақалалық корпустан — алынған нақты мақалаға сүйенеді. Аударма жоқ, синтетикалық мәтін жоқ: барлық мазмұн нативті қазақ тілінде жазылған, кириллицамен. Датасет **23 тақырыптық санатты** қамтиды. Бұл — **Zerde сериясының** екінші шығарылымы. Бірінші шығарылым — **[Zerde-QA-50K](https://huggingface.co/datasets/kurumikz/Zerde-QA-50K)** — синтетикалық жолмен жасалған техникалық датасет. Zerde-QA-Wiki-20K оны гуманитарлық, жаратылыстану және қоғамдық ғылымдар саласында нақты энциклопедиялық мәтіндер арқылы толықтырады. --- ## Негізгі деректер | Сипаттама | Мәні | |:---|:---| | **Тіл** | Қазақ тілі (`kk` / `kaz`) | | **Жазу жүйесі** | Кириллица (99.4%) · Латиница (0.6%) | | **Жалпы жұптар** | 20,002 | | **Жалпы сөздер** | 2,694,726 (~2.69M) | | **Жалпы таңбалар** | 23,065,485 (~23.1M) | | **Қазақша таңбалар** | 19,693,469 (~19.7M) | | **Файл өлшемі** | 43.8 MB (JSONL) | | **Бастапқы деректер** | [kurumikz/Cleaned-Kazakh-Wikipedia](https://huggingface.co/datasets/kurumikz/Cleaned-Kazakh-Wikipedia) | | **Санаттар** | 23 | | **Сұрақ түрлері** | 6 | | **Орт. сұрақ ұзындығы** | 18.5 сөз (8 – 77) | | **Орт. жауап ұзындығы** | 112.3 сөз (50 – 224) | | **Лицензия** | ODC-BY | --- ## 📊 Санаттар — жазбалар санына қарай сорттелген | # | Санат | Жазбалар | Орт. сұрақ | Орт. жауап | Макс. | Мин. | |:--|:---|---:|---:|---:|---:|---:| | 1 | Қазақстан тарихы және Түркітану | 1,500 | 22.0 | 121.4 | 192 | 65 | | 2 | Дүниежүзілік тарих және Өркениеттер | 1,500 | 20.4 | 118.5 | 204 | 57 | | 3 | Геосаясат және Халықаралық қатынастар | 1,500 | 17.9 | 115.6 | 219 | 60 | | 4 | Әдебиет және Мәдениеттану | 1,500 | 19.6 | 114.3 | 224 | 54 | | 5 | Классикалық Физика және Механика | 1,500 | 15.5 | 109.1 | 180 | 59 | | 6 | Биология және Өмір ғылымдары | 1,500 | 18.9 | 108.8 | 184 | 55 | | 7 | Жер туралы ғылымдар және Климатология | 2,459 | 21.4 | 116.1 | 204 | 50 | | 8 | Экономика және Қаржы | 1,374 | 17.3 | 111.2 | 181 | 65 | | 9 | Медицина және Денсаулық сақтау | 1,376 | 17.3 | 107.7 | 178 | 61 | | 10 | Лингвистика және Тіл білімі | 1,139 | 17.9 | 107.9 | 180 | 61 | | 11 | Химия және Материалтану | 907 | 16.6 | 105.3 | 166 | 61 | | 12 | Философия және Этика | 827 | 16.3 | 109.9 | 219 | 66 | | 13 | АТ және Жүйелік архитектура | 760 | 16.7 | 107.6 | 185 | 63 | | 14 | Экология және Қоршаған орта | 735 | 18.0 | 112.4 | 172 | 71 | | 15 | Теориялық және Қолданбалы Математика | 376 | 17.5 | 105.5 | 162 | 59 | | 16 | Психология және Психиатрия | 321 | 15.5 | 105.8 | 164 | 63 | | 17 | Бағдарламалау тілдері және Инженерия | 187 | 16.0 | 106.2 | 170 | 76 | | 18 | Кванттық Физика және Жоғары энергиялар | 186 | 16.4 | 108.6 | 150 | 62 | | 19 | Орта Азия және Еуразия тарихы | 160 | 20.6 | 116.9 | 193 | 75 | | 20 | Киберқауіпсіздік және Криптография | 99 | 16.3 | 108.8 | 149 | 55 | | 21 | Статистика және Ықтималдықтар теориясы | 36 | 16.1 | 112.4 | 158 | 90 | | 22 | Деректер базасы және Үлкен деректер | 25 | 16.4 | 106.8 | 148 | 85 | | 23 | Жасанды Интеллект, ML және Дата-сайнс | 22 | 15.9 | 102.3 | 118 | 84 | --- ## Деректер форматы ```json { "id": 1, "category": "Қазақстан тарихы және Түркітану", "question_type": "causal", "question": "Қазақ хандығының құрылуына қандай саяси және әлеуметтік жағдайлар себеп болды?", "answer": "Қазақ хандығы XV ғасырдың ортасында Керей мен Жәнібек сұлтандардың Өзбек хандығынан бөлінуі нәтижесінде пайда болды..." } ``` | Өріс | Түрі | Шектеулер | Сипаттама | |:---|:---|:---|:---| | `id` | `int` | Бірегей | Реттік нөмір | | `category` | `string` | 23 мәннің бірі | Мақала мазмұнынан анықталған санат | | `question_type` | `string` | 6 мәннің бірі | Талдамалық сұрақ түрі | | `question` | `string` | ≥ 12 сөз | Қазақша сұрақ | | `answer` | `string` | ≥ 300 таңба | Тек мақала мәтінінен алынған жауап | --- ## Сұрақ түрлері | Түр | Сипаттама | |:---|:---| | **causal** | Себеп-салдарлық байланыстарды талдау | | **comparative** | Екі аспектіні, құбылысты немесе кезеңді салыстыру | | **evaluative** | Маңыздылықты немесе әсерді бағалау | | **applied** | Практикалық салдарды немесе қолданысты түсіндіру | | **analytical** | Механизмді немесе процесті талдап түсіндіру | | **definitional** | Тек мәтін негізінде ұғымды анықтау | --- ## Датасет қалай жасалды Датасет **[kurumikz/Cleaned-Kazakh-Wikipedia](https://huggingface.co/datasets/kurumikz/Cleaned-Kazakh-Wikipedia)** корпусын пайдаланып, **екі сатылы автоматтандырылған конвейер** арқылы жасалды. **1-саты — Сұрақ қалыптастыру.** Әрбір мақала жеке оқылып талданады. Мақала мазмұнына сай тақырыптық санат анықталады, содан кейін материал негізінде талдамалық сұрақ қойылады. **2-саты — Тексеру және жауап жазу.** Жауап жазылмас бұрын санат тағайындауы жеке тексеруден өтеді. Мақала мазмұны санатқа сәйкес келмесе — жазба толығымен қабылданбайды. Тексеруден өткен жазбалар үшін ғана жауап жазылады, тек мақала мәтіліндегі ақпарат негізінде. **Сапа сүзгілері:** | Сүзгі | Ереже | |:---|:---| | Сұрақ ұзындығы | ≥ 12 сөз | | Жауап ұзындығы | ≥ 300 таңба | | Дедупликация | Ұқсас сұрақтар алынып тасталады | | Санат теңгерімі | Санатқа максимум 1 500 жазба | | Санат тексеруі | Жауап жазылмас бұрын жеке тексеру | --- ## Жылдам бастау ```python from datasets import load_dataset ds = load_dataset("kurumikz/Zerde-QA-Wiki-20K") print(ds["train"][0]) ``` --- ## Zerde сериясы **Zerde (Зерде)** — қазақ тілінде *ақыл*, *жад*, *зерделілік* деген мағынаны білдіреді. | Dataset | Күйі | Көлем | Сипаттама | |:---|:---:|---:|:---| | [**Zerde-QA-50K**](https://huggingface.co/datasets/kurumikz/Zerde-QA-50K) | ✅ Жарияланды | 51,422 жұп | Синтетикалық QA, STEM және IT | | **Zerde-QA-Wiki-20K** | ✅ Жарияланды | 20,002 жұп | Encyclopedic QA, Kazakh Wikipedia | | Zerde-QA-News | 🔜 Жоспарланған | — | Қазақ жаңалықтары негізінде | | Zerde-Instruct | 🔜 Жоспарланған | — | General-purpose instruction-following | | Zerde-QA-Law | 🔜 Жоспарланған | — | Қазақстан заңнамасы негізінде | --- ## Лицензия Датасет **Open Data Commons Attribution License (ODC-BY)** лицензиясы бойынша таратылады. --- *[@kurumikz](https://huggingface.co/kurumikz) жасаған*

提供机构：

kurumikz

5,000+

优质数据集

54 个

任务类型

进入经典数据集