Sandroeth/math-reasoning-50k-id
收藏Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Sandroeth/math-reasoning-50k-id
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- id
tags:
- math
- reasoning
- chain-of-thought
- indonesian
- arithmetic
- education
pretty_name: Math Reasoning 10K Indonesian
size_categories:
- 10K<n<100K
task_categories:
- question-answering
- text-generation
---
# Math Reasoning 50K - Indonesian (math-reasoning-50k-id)
A synthetic Indonesian-language math reasoning dataset with **50,000 samples**
designed to train language models on arithmetic problem solving with
explicit chain-of-thought reasoning.
---
## Dataset Summary
This dataset contains math problems written in natural **Bahasa Indonesia**,
covering basic to advanced arithmetic operations. Each sample includes
an internal reasoning trace (`reason`) that models how a student would
think through the problem step-by-step before producing the final answer.
The dataset follows a **3-field structure**:
| Field | Role |
|----------|----------------------------------------------------------------------|
| `input` | Math problem in Bahasa Indonesia (natural language or story problem) |
| `reason` | Internal chain-of-thought reasoning (not shown to end user) |
| `output` | Final numeric answer |
---
## Dataset Statistics
| Level | Count | Description |
|-----------|--------|----------------------------------------------------------|
| Easy | ~10,000 | Basic operations, numbers 1-100 |
| Medium | ~12,500 | Mid-range numbers, mixed ops, percentages, squares |
| Hard | ~12,500 | Large numbers, square roots, multi-step, complex mixed |
| Very-hard | ~15,000 | Sin/cos/tan, log, exponents, derivatives, integrals, etc.|
**Total: 50,000 samples**
---
## Fields
```json
{
"id": 1,
"level": "mudah",
"input": "Andi memiliki 12 kelereng, lalu diberi 7 kelereng lagi. Berapa total kelereng Andi?",
"reason": "Operasi penjumlahan. 12 + 7 = 19. Jawaban: 19.",
"output": "19"
}
```
- **id** `int` - Unique sample index (1-50000)
- **level** `str` - Difficulty level: `mudah` / `sedang` / `susah` / `Sangat-Susah`
- **input** `str` - Problem statement in Bahasa Indonesia
- **reason** `str` - Step-by-step internal reasoning (chain-of-thought)
- **output** `str` - Correct numeric answer
---
## Operations Covered
| Operation | Example | Levels |
|-------------------|----------------------|-------------|
| Addition | 12 + 7 = ? | Easy-Hard |
| Subtraction | 85 - 34 = ? | Easy-Hard |
| Multiplication | 9 x 8 = ? | Easy-Hard |
| Division | 120 / 6 = ? | Easy-Hard |
| Mixed (add x mul) | (5 + 3) x 4 = ? | Medium-Hard |
| Mixed (sub x mul) | (10 - 3) x 5 = ? | Medium-Hard |
| Squares | 7^2 = ? | Medium-Hard |
| Square roots | sqrt(144) = ? | Hard |
| Percentages | 25% dari 400 = ? | Medium-Hard |
| Multi-step | 3 + 4 x 5 - 2 = ? | Hard |
---
## Usage
### Load with Hugging Face Datasets
```python
from datasets import load_dataset
ds = load_dataset("Sandroeth/math-reasoning-50k-id")
print(ds["train"][0])
```
### Fine-tuning (Input -> Output, ignoring reason)
```python
sample = ds["train"][0]
prompt = sample["input"]
label = sample["output"]
```
### Fine-tuning with Chain-of-Thought (Input -> Reason + Output)
```python
sample = ds["train"][0]
prompt = sample["input"]
target = sample["reason"] + " Jawaban akhir: " + sample["output"]
```
---
## Recommended Training Format
```
### Input:
Andi memiliki 12 kelereng, lalu diberi 7 kelereng lagi. Berapa total kelereng Andi?
### Reason:
Operasi penjumlahan. 12 + 7 = 19. Jawaban: 19.
### Output:
19
```
---
## Dataset Generation
This dataset was synthetically generated using a Python script with:
- Template-based natural language question generation (10+ variants per operation)
- Contextual story problems (market, school, travel scenarios)
- Deterministic reasoning traces constructed programmatically
- Stratified sampling across difficulty levels (20% / 25% / 25% / 30%)
---
## License
This dataset is released under the **MIT License**.
Free to use for research, fine-tuning, and commercial applications.
---
## Citation
If you use this dataset in your research or project, please cite:
```bibtex
@dataset{sandroeth2025mathid,
author = {Sandroeth},
title = {Math Reasoning 50K Indonesian: A Synthetic Arithmetic Dataset with Chain-of-Thought for Bahasa Indonesia},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/Sandroeth/math-reasoning-10k-id},
note = {Synthetic dataset covering arithmetic operations with internal reasoning traces in Bahasa Indonesia}
}
```
---
## Contact
Created by **Sandroeth** - Hugging Face: https://huggingface.co/Sandroeth
Contributions, issues, and feedback are welcome.
提供机构:
Sandroeth



