five

reelva/dayak-ngaju-sft

收藏
Hugging Face2026-03-30 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/reelva/dayak-ngaju-sft
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: source dtype: string - name: is_synthetic dtype: bool - name: target_word dtype: string - name: pivot dtype: string splits: - name: train num_examples: 42813 task_categories: - translation language: - nij - id - en tags: - dayak-ngaju - low-resource-language - instruction-tuning - alpaca license: apache-2.0 size_categories: - 10K<n<100K --- # Dataset Card: Dayak Ngaju (NIJ) Trilingual SFT Dataset: The SOTA Solution for Authentic Preservation ## 💡 Summary & Impact Current general-purpose LLMs (e.g., DeepSeek, Gemini, LLaMA) often fail to accurately translate Dayak Ngaju (NIJ). Due to a significant lack of high-quality digital training data, these models are prone to **linguistic hallucination**, frequently mixing the authentic Dayak Ngaju structure with neighboring languages (such as Banjar or Malay) or English/Indonesian grammatical rules. **This dataset (NIJ-Translate SFT Dataset) is the essential, specialized solution to that systemic failure.** It provides approximately 42,800 instruction-following translation pairs curated specifically to address the failures in existing generalized models. This dataset enables the creation of translation agents that speak the **authentic, grammatically accurate, and standardized Dayak Ngaju language**. ## 🔄 Continuous Improvement This dataset is an **actively maintained project**. We are committed to: * **Periodic Updates:** Regularly adding new high-quality translation pairs. * **Refinement:** Continuous cleaning and filtering of data to improve grammatical accuracy and eliminate noise/hallucinations. ## Dataset Structure Each instance in the dataset represents a trilingual translation task, formatted in the Alpaca prompt style for maximum compatibility with causal language models (SLMs/LLMs under 3B parameters). - `instruction`: The task directive (e.g., "Terjemahkan kalimat berikut dari bahasa Indonesia ke bahasa Dayak Ngaju."). - `input`: The authentic source sentence. - `output`: The accurate, pure, and standardized translated sentence. - `is_synthetic`: Boolean indicating if the pair was synthetically augmented or directly extracted from source texts. - `pivot`: The language pair direction (e.g., "dayak-indo", "english-dayak"). ## 🏆 Authentic Data Sources To ensure the output of your fine-tuned model does not regress to the common mixed-language hallucinations found in large models, this dataset was meticulously curated from foundational, authoritative academic resources. 1. **Tata Bahasa Dayak Ngaju (Kemdikbud):** Source of foundational morphology, syntactical rules, and grammatical standards. * Source: [Institution Repository of Ministry of Education and Culture](https://repositori.kemendikdasmen.go.id/3697/1/Tata%20Bahasa%20Dayak%20Ngaju%20%20%20236h.pdf) 2. **Analisis Diftong Bahasa Dayak Ngaju:** Fundamental for phonological standardization (e.g., the correct usage of the [ei] diphthong, such as *sungei*). * Source: [Jurnal Bitnet UMPR](https://journal.umpr.ac.id/index.php/bitnet/article/download/9831/5536) 3. **Kamus Pelajar Dayak Ngaju - Indonesia (English Pivot):** Standardized daily vocabulary, idiom core, and English-pivot translation bridges. * Source: [Academia.edu](https://www.academia.edu/36399788/Kamus_Pelajar_Dayak_Ngaju_Indonesia_Indonesia_Dayak_Ngaju) 4. **Sastra Lisan Dayak Ngaju:** Contextual sentences, oral literature, and historical idioms to provide depth beyond simple dictionary lookups. * Source: [Academia.edu](https://www.academia.edu/34243215/Sastra_lisan_dayak_ngaju) ## Intended Use This dataset is optimized for training cultural preservation agents, trilingual chatbots, and high-fidelity linguistic tools.
提供机构:
reelva
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作