kurumikz/Zerde-QA-50K
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/kurumikz/Zerde-QA-50K
下载链接
链接失效反馈官方服务:
资源简介:
---
license: odc-by
language:
- kk
task_categories:
- question-answering
- text-generation
tags:
- kazakh
- synthetic
- instruction-tuning
- qa
- nlp
- central-asia
- llm-training
pretty_name: Zerde-QA-50K
size_categories:
- 10K<n<100K
configs:
- config_name: default
data_files:
- split: train
path: "*.jsonl"
---
# 🇰🇿 Zerde-QA-50K
> **A large-scale synthetic Kazakh question-answer dataset for instruction tuning and NLP research.**
> Created and maintained by kurumikz. Free to use with attribution.
---
## 📌 Overview
**Zerde-QA-50K** is a synthetically generated open-domain QA dataset written entirely in the **Kazakh language** (`kk`), consisting of **51,422 high-quality question-answer pairs** spanning 20+ academic and professional domains.
Each record follows a clean `{question, answer}` structure with no auxiliary fields. Questions are predominantly analytical, comparative, and explanatory in nature — going well beyond simple factual lookups. This makes the dataset especially suited for training instruction-following models that need to reason and explain, not just retrieve.
Kazakh is spoken by over 17 million people but remains significantly underrepresented in NLP benchmarks and training corpora. Zerde-QA-50K aims to close that gap by providing a large, topically diverse, and structurally consistent resource for the Kazakh NLP community.
> **Attribution required.** If you use this dataset in your work, you must credit the author as specified in the [License](#-license) section below.
---
## 📊 Dataset Statistics
| Metric | Value |
|--------|-------|
| Total records | **51,422** |
| File format | JSONL |
| Total characters | ~58.97 million |
| Cyrillic share | 84.9% |
| Kazakh-specific chars (ә, ғ, қ, ң, ө, ұ, ү, і, һ) | ~8.7M (14.75%) |
| Avg. question length | 207 chars / ~21 words |
| Avg. answer length | 940 chars / ~102 words |
| Answers in 500–2,000 char range | 97.5% |
| Duplicate records | 0 |
| Records with null/empty fields | 0 |
The high proportion of Kazakh-specific characters confirms genuine native-script coverage — not transliterated or mixed-language content.
### Field Length Stats (in characters)
| Field | Avg | Median | Min | Max |
|------------|-----|--------|-----|--------|
| `question` | 207 | ~190 | 20 | 800+ |
| `answer` | 940 | ~900 | 80 | 3,000+ |
---
## 🗂️ Format
The dataset is stored as **JSONL** (JSON Lines) — one JSON object per line.
Every record has exactly two fields:
```json
{
"question": "Терең нейрондық желілердің классикалық машиналық оқыту әдістерінен негізгі айырмашылығы неде?",
"answer": "Терең нейрондық желілер иерархиялық ерекшеліктерді автоматты түрде үйренеді — төменгі деңгейдегі қарапайым белгілерден бастап жоғары деңгейдегі күрделі абстракцияларға дейін. Классикалық әдістерде бұл ерекшеліктерді қолмен анықтау қажет болды..."
}
```
| Field | Type | Description |
|------------|--------|----------------------------------|
| `question` | string | A question in Kazakh (Cyrillic) |
| `answer` | string | A detailed explanation in Kazakh |
No null values. No empty strings. No duplicate records.
Encoding: **UTF-8**. No BOM.
---
## 📚 Domains
The dataset spans **20+ broad subject areas**, each with multiple subtopics:
### 💻 Technology & Engineering
- **Information Technology & System Architecture** — OS theory, distributed systems, cloud, IoT, quantum computing, embedded systems
- **Programming Languages & Software Engineering** — paradigms, algorithms, compilers, design patterns, testing, DevOps, CI/CD
- **Databases & Big Data** — SQL/NoSQL, ACID, indexing, data warehouses, stream processing, vector databases
- **Cybersecurity & Cryptography** — CIA triad, network security, penetration testing, zero trust, post-quantum crypto
### 🤖 AI & Data Science
- **Artificial Intelligence, ML & Data Science** — supervised/unsupervised/RL, deep learning, NLP, computer vision, MLOps, transformers, federated learning, AGI
### 🔬 Natural Sciences
- **Mathematics & Formal Logic** — number theory, topology, combinatorics, graph theory, mathematical proofs
- **Physics & Engineering** — thermodynamics, electromagnetism, quantum mechanics, aerodynamics, materials science
- **Chemistry & Biology** — organic chemistry, biochemistry, genetics, ecology, cell biology
### 🏥 Medicine & Health
- **Medicine & Public Health** — anatomy, pharmacology, epidemiology, neuroscience, medical ethics
### 📐 Social Sciences & Humanities
- **Economics & Finance** — macroeconomics, behavioral economics, monetary policy, financial markets
- **Law & Political Science** — constitutional law, international law, political systems, governance
- **History & Archaeology** — world history, civilizations, historiography, archaeological methods
- **Philosophy & Ethics** — epistemology, metaphysics, ethics, philosophy of science
- **Psychology & Cognitive Science** — developmental psychology, cognitive biases, neuroscience, social psychology
- **Sociology & Anthropology** — social structures, cultural anthropology, demography
- **Linguistics & Literature** — language families, morphology, literary analysis, semiotics
### 🎓 Applied & Interdisciplinary
- **Education & Pedagogy** — learning theories, curriculum design, assessment
- **Geography & Environmental Science** — physical geography, climate science, environmental policy
- **International Relations & Regional Studies** — geopolitics, diplomacy, international organizations
- **Culture, Art & Religion** — cultural studies, art history, comparative religion
> **Note on topic labels:** Records do not carry per-item topic tags. The domain list above reflects the intended generation scope; actual distribution per sample may vary.
---
## ⚙️ Generation Process
Pairs were generated synthetically using a large language model prompted entirely in Kazakh. The pipeline was built around five core principles:
**Broad topical coverage** — prompts were built from a curated pool of 400+ domain-specific subtopics organized into 20+ thematic categories, ensuring no single subject dominates the corpus.
**Diversity enforcement** — each generation batch was seeded with 2–3 randomly sampled subtopics, and a sliding window of recently generated questions was fed back to the model to actively discourage repetition across batches.
**Strict deduplication** — all records were deduplicated at the question level using prefix matching on the first 60 characters (case-normalized) before being written to disk.
**Structural validation** — only records containing both a non-empty `question` and `answer` were accepted. Malformed JSON and incomplete pairs were discarded automatically.
**Parallel generation with fault tolerance** — a multi-worker concurrent pipeline maximized throughput while respecting rate limits, with exponential back-off on failures and automatic retries.
---
## 🔍 Data Quality
| Check | Result |
|-------|--------|
| JSON parse errors | ✅ 0 |
| Missing `question` or `answer` | ✅ 0 |
| Null / None values | ✅ 0 |
| Exact duplicate records | ✅ 0 |
| Duplicate questions | ✅ 0 |
**Character script breakdown** (~58.97M total chars):
| Script | Share |
|--------|-------|
| Cyrillic | 84.9% |
| Kazakh-specific letters | 14.75% |
| Latin (technical terms) | ~0.1% |
| Digits & punctuation | remainder |
### Known Minor Issues
| Issue | Count | Notes |
|-------|-------|-------|
| Malformed JSON line | 1 | Double object on row ~1424; handled by any robust JSONL parser |
| Questions without trailing `?` | 501 (~1%) | Many are imperative prompts ("Түсіндіріңіз…") — stylistically valid in Kazakh |
| Double spaces in answers | 2,519 | Cosmetic only; one normalization pass cleans these |
| Answers echoing the question | 3 | Copy-paste generation artifacts |
| Answers with Latin-script terms | 57 | Expected for technical vocabulary (API, ACID, OWASP, etc.) |
| Questions over 400 chars | 316 | Scenario-based or compound prompts, not errors |
---
## 💡 Intended Use
This dataset is designed for:
- **Instruction fine-tuning** of causal language models (GPT-style) in Kazakh
- **Low-resource NLP research** — few-shot and zero-shot learning for underrepresented languages
- **Retrieval-Augmented Generation (RAG)** — as a broad Kazakh-language knowledge base
- **Evaluation** of Kazakh-language comprehension and generation capabilities
- **Pretraining data augmentation** for multilingual models
### Out-of-Scope Uses
- Direct deployment as a factual knowledge base without human verification — the dataset is synthetically generated and may contain inaccuracies.
- Use as a ground-truth benchmark for exact factual recall.
### Recommended Training Format
For instruction fine-tuning, wrap records like this:
```
### Сұрақ:
{question}
### Жауап:
{answer}
```
Or in chat/instruct format:
```
<|user|>
{question}
<|assistant|>
{answer}
```
---
## ⚙️ Loading the Dataset
### With 🤗 Hugging Face `datasets`
```python
from datasets import load_dataset
ds = load_dataset("kurumikz/Zerde-QA-50K")
print(ds["train"][0])
# {'question': '...', 'answer': '...'}
```
### Manually with Python
```python
import json
records = []
with open("data.jsonl", "r", encoding="utf-8") as f:
for line in f:
records.append(json.loads(line))
print(f"Loaded {len(records)} records")
print(records[0])
```
### With pandas
```python
import pandas as pd
df = pd.read_json("data.jsonl", lines=True)
print(df.shape) # (51422, 2)
print(df["answer"].str.len().describe()) # length stats
```
---
## 🗃️ Related Datasets
| Dataset | Records | Scope | Link |
|---------|---------|-------|------|
| **Zerde-QA-50K** *(this)* | 51,422 | 20+ broad domains: IT, AI, science, humanities & more | — |
| **Question-Answering_Kazakh** | 1,424 | Kazakhstan-focused: history, geography, culture, language | [→ View on HF](https://huggingface.co/datasets/kurumikz/Question-Answering_Kazakh) |
**Question-Answering_Kazakh** is a compact, Kazakhstan-centric subset released separately — ideal for domain-specific fine-tuning on Kazakh national knowledge.
---
## 📜 License
This dataset is released under the **[Open Data Commons Attribution License (ODC-By) v1.0](https://opendatacommons.org/licenses/by/1-0/)**.
You are free to share, use, and adapt this dataset for any purpose — including commercial — **provided you give appropriate credit to the author.**
**Suggested attribution:**
> *"This work uses the Zerde-QA-50K dataset, created by **kurumikz**, available at huggingface.co/datasets/kurumikz/Zerde-QA-50K. Licensed under ODC-By 1.0."*
---
## 🙏 Citation
If you use Zerde-QA-50K in academic work, please cite it as:
```bibtex
@dataset{kurumikz2025zerdeqa50k,
author = {kurumikz},
title = {Zerde-QA-50K: A Large-Scale Synthetic Kazakh Question-Answer Dataset},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/kurumikz/Zerde-QA-50K},
license = {ODC-By 1.0}
}
```
---
## ✉️ Contact
For questions, feedback, or collaboration inquiries:
**kurumikz** — kurumikaz@gmail.com
---
*Made with ❤️ for the Kazakh NLP community.*
提供机构:
kurumikz



