reelva/dayak-ngaju-sft
收藏Hugging Face2026-03-30 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/reelva/dayak-ngaju-sft
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: source
dtype: string
- name: is_synthetic
dtype: bool
- name: target_word
dtype: string
- name: pivot
dtype: string
splits:
- name: train
num_examples: 42813
task_categories:
- translation
language:
- nij
- id
- en
tags:
- dayak-ngaju
- low-resource-language
- instruction-tuning
- alpaca
license: apache-2.0
size_categories:
- 10K<n<100K
---
# Dataset Card: Dayak Ngaju (NIJ) Trilingual SFT Dataset: The SOTA Solution for Authentic Preservation
## 💡 Summary & Impact
Current general-purpose LLMs (e.g., DeepSeek, Gemini, LLaMA) often fail to accurately translate Dayak Ngaju (NIJ). Due to a significant lack of high-quality digital training data, these models are prone to **linguistic hallucination**, frequently mixing the authentic Dayak Ngaju structure with neighboring languages (such as Banjar or Malay) or English/Indonesian grammatical rules.
**This dataset (NIJ-Translate SFT Dataset) is the essential, specialized solution to that systemic failure.**
It provides approximately 42,800 instruction-following translation pairs curated specifically to address the failures in existing generalized models. This dataset enables the creation of translation agents that speak the **authentic, grammatically accurate, and standardized Dayak Ngaju language**.
## 🔄 Continuous Improvement
This dataset is an **actively maintained project**. We are committed to:
* **Periodic Updates:** Regularly adding new high-quality translation pairs.
* **Refinement:** Continuous cleaning and filtering of data to improve grammatical accuracy and eliminate noise/hallucinations.
## Dataset Structure
Each instance in the dataset represents a trilingual translation task, formatted in the Alpaca prompt style for maximum compatibility with causal language models (SLMs/LLMs under 3B parameters).
- `instruction`: The task directive (e.g., "Terjemahkan kalimat berikut dari bahasa Indonesia ke bahasa Dayak Ngaju.").
- `input`: The authentic source sentence.
- `output`: The accurate, pure, and standardized translated sentence.
- `is_synthetic`: Boolean indicating if the pair was synthetically augmented or directly extracted from source texts.
- `pivot`: The language pair direction (e.g., "dayak-indo", "english-dayak").
## 🏆 Authentic Data Sources
To ensure the output of your fine-tuned model does not regress to the common mixed-language hallucinations found in large models, this dataset was meticulously curated from foundational, authoritative academic resources.
1. **Tata Bahasa Dayak Ngaju (Kemdikbud):** Source of foundational morphology, syntactical rules, and grammatical standards.
* Source: [Institution Repository of Ministry of Education and Culture](https://repositori.kemendikdasmen.go.id/3697/1/Tata%20Bahasa%20Dayak%20Ngaju%20%20%20236h.pdf)
2. **Analisis Diftong Bahasa Dayak Ngaju:** Fundamental for phonological standardization (e.g., the correct usage of the [ei] diphthong, such as *sungei*).
* Source: [Jurnal Bitnet UMPR](https://journal.umpr.ac.id/index.php/bitnet/article/download/9831/5536)
3. **Kamus Pelajar Dayak Ngaju - Indonesia (English Pivot):** Standardized daily vocabulary, idiom core, and English-pivot translation bridges.
* Source: [Academia.edu](https://www.academia.edu/36399788/Kamus_Pelajar_Dayak_Ngaju_Indonesia_Indonesia_Dayak_Ngaju)
4. **Sastra Lisan Dayak Ngaju:** Contextual sentences, oral literature, and historical idioms to provide depth beyond simple dictionary lookups.
* Source: [Academia.edu](https://www.academia.edu/34243215/Sastra_lisan_dayak_ngaju)
## Intended Use
This dataset is optimized for training cultural preservation agents, trilingual chatbots, and high-fidelity linguistic tools.
提供机构:
reelva



