five

kurumikz/Zerde-QA-50K

收藏
Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/kurumikz/Zerde-QA-50K
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: odc-by language: - kk task_categories: - question-answering - text-generation tags: - kazakh - synthetic - instruction-tuning - qa - nlp - central-asia - llm-training pretty_name: Zerde-QA-50K size_categories: - 10K<n<100K configs: - config_name: default data_files: - split: train path: "*.jsonl" --- # 🇰🇿 Zerde-QA-50K > **A large-scale synthetic Kazakh question-answer dataset for instruction tuning and NLP research.** > Created and maintained by kurumikz. Free to use with attribution. --- ## 📌 Overview **Zerde-QA-50K** is a synthetically generated open-domain QA dataset written entirely in the **Kazakh language** (`kk`), consisting of **51,422 high-quality question-answer pairs** spanning 20+ academic and professional domains. Each record follows a clean `{question, answer}` structure with no auxiliary fields. Questions are predominantly analytical, comparative, and explanatory in nature — going well beyond simple factual lookups. This makes the dataset especially suited for training instruction-following models that need to reason and explain, not just retrieve. Kazakh is spoken by over 17 million people but remains significantly underrepresented in NLP benchmarks and training corpora. Zerde-QA-50K aims to close that gap by providing a large, topically diverse, and structurally consistent resource for the Kazakh NLP community. > **Attribution required.** If you use this dataset in your work, you must credit the author as specified in the [License](#-license) section below. --- ## 📊 Dataset Statistics | Metric | Value | |--------|-------| | Total records | **51,422** | | File format | JSONL | | Total characters | ~58.97 million | | Cyrillic share | 84.9% | | Kazakh-specific chars (ә, ғ, қ, ң, ө, ұ, ү, і, һ) | ~8.7M (14.75%) | | Avg. question length | 207 chars / ~21 words | | Avg. answer length | 940 chars / ~102 words | | Answers in 500–2,000 char range | 97.5% | | Duplicate records | 0 | | Records with null/empty fields | 0 | The high proportion of Kazakh-specific characters confirms genuine native-script coverage — not transliterated or mixed-language content. ### Field Length Stats (in characters) | Field | Avg | Median | Min | Max | |------------|-----|--------|-----|--------| | `question` | 207 | ~190 | 20 | 800+ | | `answer` | 940 | ~900 | 80 | 3,000+ | --- ## 🗂️ Format The dataset is stored as **JSONL** (JSON Lines) — one JSON object per line. Every record has exactly two fields: ```json { "question": "Терең нейрондық желілердің классикалық машиналық оқыту әдістерінен негізгі айырмашылығы неде?", "answer": "Терең нейрондық желілер иерархиялық ерекшеліктерді автоматты түрде үйренеді — төменгі деңгейдегі қарапайым белгілерден бастап жоғары деңгейдегі күрделі абстракцияларға дейін. Классикалық әдістерде бұл ерекшеліктерді қолмен анықтау қажет болды..." } ``` | Field | Type | Description | |------------|--------|----------------------------------| | `question` | string | A question in Kazakh (Cyrillic) | | `answer` | string | A detailed explanation in Kazakh | No null values. No empty strings. No duplicate records. Encoding: **UTF-8**. No BOM. --- ## 📚 Domains The dataset spans **20+ broad subject areas**, each with multiple subtopics: ### 💻 Technology & Engineering - **Information Technology & System Architecture** — OS theory, distributed systems, cloud, IoT, quantum computing, embedded systems - **Programming Languages & Software Engineering** — paradigms, algorithms, compilers, design patterns, testing, DevOps, CI/CD - **Databases & Big Data** — SQL/NoSQL, ACID, indexing, data warehouses, stream processing, vector databases - **Cybersecurity & Cryptography** — CIA triad, network security, penetration testing, zero trust, post-quantum crypto ### 🤖 AI & Data Science - **Artificial Intelligence, ML & Data Science** — supervised/unsupervised/RL, deep learning, NLP, computer vision, MLOps, transformers, federated learning, AGI ### 🔬 Natural Sciences - **Mathematics & Formal Logic** — number theory, topology, combinatorics, graph theory, mathematical proofs - **Physics & Engineering** — thermodynamics, electromagnetism, quantum mechanics, aerodynamics, materials science - **Chemistry & Biology** — organic chemistry, biochemistry, genetics, ecology, cell biology ### 🏥 Medicine & Health - **Medicine & Public Health** — anatomy, pharmacology, epidemiology, neuroscience, medical ethics ### 📐 Social Sciences & Humanities - **Economics & Finance** — macroeconomics, behavioral economics, monetary policy, financial markets - **Law & Political Science** — constitutional law, international law, political systems, governance - **History & Archaeology** — world history, civilizations, historiography, archaeological methods - **Philosophy & Ethics** — epistemology, metaphysics, ethics, philosophy of science - **Psychology & Cognitive Science** — developmental psychology, cognitive biases, neuroscience, social psychology - **Sociology & Anthropology** — social structures, cultural anthropology, demography - **Linguistics & Literature** — language families, morphology, literary analysis, semiotics ### 🎓 Applied & Interdisciplinary - **Education & Pedagogy** — learning theories, curriculum design, assessment - **Geography & Environmental Science** — physical geography, climate science, environmental policy - **International Relations & Regional Studies** — geopolitics, diplomacy, international organizations - **Culture, Art & Religion** — cultural studies, art history, comparative religion > **Note on topic labels:** Records do not carry per-item topic tags. The domain list above reflects the intended generation scope; actual distribution per sample may vary. --- ## ⚙️ Generation Process Pairs were generated synthetically using a large language model prompted entirely in Kazakh. The pipeline was built around five core principles: **Broad topical coverage** — prompts were built from a curated pool of 400+ domain-specific subtopics organized into 20+ thematic categories, ensuring no single subject dominates the corpus. **Diversity enforcement** — each generation batch was seeded with 2–3 randomly sampled subtopics, and a sliding window of recently generated questions was fed back to the model to actively discourage repetition across batches. **Strict deduplication** — all records were deduplicated at the question level using prefix matching on the first 60 characters (case-normalized) before being written to disk. **Structural validation** — only records containing both a non-empty `question` and `answer` were accepted. Malformed JSON and incomplete pairs were discarded automatically. **Parallel generation with fault tolerance** — a multi-worker concurrent pipeline maximized throughput while respecting rate limits, with exponential back-off on failures and automatic retries. --- ## 🔍 Data Quality | Check | Result | |-------|--------| | JSON parse errors | ✅ 0 | | Missing `question` or `answer` | ✅ 0 | | Null / None values | ✅ 0 | | Exact duplicate records | ✅ 0 | | Duplicate questions | ✅ 0 | **Character script breakdown** (~58.97M total chars): | Script | Share | |--------|-------| | Cyrillic | 84.9% | | Kazakh-specific letters | 14.75% | | Latin (technical terms) | ~0.1% | | Digits & punctuation | remainder | ### Known Minor Issues | Issue | Count | Notes | |-------|-------|-------| | Malformed JSON line | 1 | Double object on row ~1424; handled by any robust JSONL parser | | Questions without trailing `?` | 501 (~1%) | Many are imperative prompts ("Түсіндіріңіз…") — stylistically valid in Kazakh | | Double spaces in answers | 2,519 | Cosmetic only; one normalization pass cleans these | | Answers echoing the question | 3 | Copy-paste generation artifacts | | Answers with Latin-script terms | 57 | Expected for technical vocabulary (API, ACID, OWASP, etc.) | | Questions over 400 chars | 316 | Scenario-based or compound prompts, not errors | --- ## 💡 Intended Use This dataset is designed for: - **Instruction fine-tuning** of causal language models (GPT-style) in Kazakh - **Low-resource NLP research** — few-shot and zero-shot learning for underrepresented languages - **Retrieval-Augmented Generation (RAG)** — as a broad Kazakh-language knowledge base - **Evaluation** of Kazakh-language comprehension and generation capabilities - **Pretraining data augmentation** for multilingual models ### Out-of-Scope Uses - Direct deployment as a factual knowledge base without human verification — the dataset is synthetically generated and may contain inaccuracies. - Use as a ground-truth benchmark for exact factual recall. ### Recommended Training Format For instruction fine-tuning, wrap records like this: ``` ### Сұрақ: {question} ### Жауап: {answer} ``` Or in chat/instruct format: ``` <|user|> {question} <|assistant|> {answer} ``` --- ## ⚙️ Loading the Dataset ### With 🤗 Hugging Face `datasets` ```python from datasets import load_dataset ds = load_dataset("kurumikz/Zerde-QA-50K") print(ds["train"][0]) # {'question': '...', 'answer': '...'} ``` ### Manually with Python ```python import json records = [] with open("data.jsonl", "r", encoding="utf-8") as f: for line in f: records.append(json.loads(line)) print(f"Loaded {len(records)} records") print(records[0]) ``` ### With pandas ```python import pandas as pd df = pd.read_json("data.jsonl", lines=True) print(df.shape) # (51422, 2) print(df["answer"].str.len().describe()) # length stats ``` --- ## 🗃️ Related Datasets | Dataset | Records | Scope | Link | |---------|---------|-------|------| | **Zerde-QA-50K** *(this)* | 51,422 | 20+ broad domains: IT, AI, science, humanities & more | — | | **Question-Answering_Kazakh** | 1,424 | Kazakhstan-focused: history, geography, culture, language | [→ View on HF](https://huggingface.co/datasets/kurumikz/Question-Answering_Kazakh) | **Question-Answering_Kazakh** is a compact, Kazakhstan-centric subset released separately — ideal for domain-specific fine-tuning on Kazakh national knowledge. --- ## 📜 License This dataset is released under the **[Open Data Commons Attribution License (ODC-By) v1.0](https://opendatacommons.org/licenses/by/1-0/)**. You are free to share, use, and adapt this dataset for any purpose — including commercial — **provided you give appropriate credit to the author.** **Suggested attribution:** > *"This work uses the Zerde-QA-50K dataset, created by **kurumikz**, available at huggingface.co/datasets/kurumikz/Zerde-QA-50K. Licensed under ODC-By 1.0."* --- ## 🙏 Citation If you use Zerde-QA-50K in academic work, please cite it as: ```bibtex @dataset{kurumikz2025zerdeqa50k, author = {kurumikz}, title = {Zerde-QA-50K: A Large-Scale Synthetic Kazakh Question-Answer Dataset}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/kurumikz/Zerde-QA-50K}, license = {ODC-By 1.0} } ``` --- ## ✉️ Contact For questions, feedback, or collaboration inquiries: **kurumikz** — kurumikaz@gmail.com --- *Made with ❤️ for the Kazakh NLP community.*
提供机构:
kurumikz
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作