five

Sandroeth/math-reasoning-50k-id

收藏
Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Sandroeth/math-reasoning-50k-id
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - id tags: - math - reasoning - chain-of-thought - indonesian - arithmetic - education pretty_name: Math Reasoning 10K Indonesian size_categories: - 10K<n<100K task_categories: - question-answering - text-generation --- # Math Reasoning 50K - Indonesian (math-reasoning-50k-id) A synthetic Indonesian-language math reasoning dataset with **50,000 samples** designed to train language models on arithmetic problem solving with explicit chain-of-thought reasoning. --- ## Dataset Summary This dataset contains math problems written in natural **Bahasa Indonesia**, covering basic to advanced arithmetic operations. Each sample includes an internal reasoning trace (`reason`) that models how a student would think through the problem step-by-step before producing the final answer. The dataset follows a **3-field structure**: | Field | Role | |----------|----------------------------------------------------------------------| | `input` | Math problem in Bahasa Indonesia (natural language or story problem) | | `reason` | Internal chain-of-thought reasoning (not shown to end user) | | `output` | Final numeric answer | --- ## Dataset Statistics | Level | Count | Description | |-----------|--------|----------------------------------------------------------| | Easy | ~10,000 | Basic operations, numbers 1-100 | | Medium | ~12,500 | Mid-range numbers, mixed ops, percentages, squares | | Hard | ~12,500 | Large numbers, square roots, multi-step, complex mixed | | Very-hard | ~15,000 | Sin/cos/tan, log, exponents, derivatives, integrals, etc.| **Total: 50,000 samples** --- ## Fields ```json { "id": 1, "level": "mudah", "input": "Andi memiliki 12 kelereng, lalu diberi 7 kelereng lagi. Berapa total kelereng Andi?", "reason": "Operasi penjumlahan. 12 + 7 = 19. Jawaban: 19.", "output": "19" } ``` - **id** `int` - Unique sample index (1-50000) - **level** `str` - Difficulty level: `mudah` / `sedang` / `susah` / `Sangat-Susah` - **input** `str` - Problem statement in Bahasa Indonesia - **reason** `str` - Step-by-step internal reasoning (chain-of-thought) - **output** `str` - Correct numeric answer --- ## Operations Covered | Operation | Example | Levels | |-------------------|----------------------|-------------| | Addition | 12 + 7 = ? | Easy-Hard | | Subtraction | 85 - 34 = ? | Easy-Hard | | Multiplication | 9 x 8 = ? | Easy-Hard | | Division | 120 / 6 = ? | Easy-Hard | | Mixed (add x mul) | (5 + 3) x 4 = ? | Medium-Hard | | Mixed (sub x mul) | (10 - 3) x 5 = ? | Medium-Hard | | Squares | 7^2 = ? | Medium-Hard | | Square roots | sqrt(144) = ? | Hard | | Percentages | 25% dari 400 = ? | Medium-Hard | | Multi-step | 3 + 4 x 5 - 2 = ? | Hard | --- ## Usage ### Load with Hugging Face Datasets ```python from datasets import load_dataset ds = load_dataset("Sandroeth/math-reasoning-50k-id") print(ds["train"][0]) ``` ### Fine-tuning (Input -> Output, ignoring reason) ```python sample = ds["train"][0] prompt = sample["input"] label = sample["output"] ``` ### Fine-tuning with Chain-of-Thought (Input -> Reason + Output) ```python sample = ds["train"][0] prompt = sample["input"] target = sample["reason"] + " Jawaban akhir: " + sample["output"] ``` --- ## Recommended Training Format ``` ### Input: Andi memiliki 12 kelereng, lalu diberi 7 kelereng lagi. Berapa total kelereng Andi? ### Reason: Operasi penjumlahan. 12 + 7 = 19. Jawaban: 19. ### Output: 19 ``` --- ## Dataset Generation This dataset was synthetically generated using a Python script with: - Template-based natural language question generation (10+ variants per operation) - Contextual story problems (market, school, travel scenarios) - Deterministic reasoning traces constructed programmatically - Stratified sampling across difficulty levels (20% / 25% / 25% / 30%) --- ## License This dataset is released under the **MIT License**. Free to use for research, fine-tuning, and commercial applications. --- ## Citation If you use this dataset in your research or project, please cite: ```bibtex @dataset{sandroeth2025mathid, author = {Sandroeth}, title = {Math Reasoning 50K Indonesian: A Synthetic Arithmetic Dataset with Chain-of-Thought for Bahasa Indonesia}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/Sandroeth/math-reasoning-10k-id}, note = {Synthetic dataset covering arithmetic operations with internal reasoning traces in Bahasa Indonesia} } ``` --- ## Contact Created by **Sandroeth** - Hugging Face: https://huggingface.co/Sandroeth Contributions, issues, and feedback are welcome.
提供机构:
Sandroeth
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作