AksaraLLM/aksara-sft-clean-v6

Name: AksaraLLM/aksara-sft-clean-v6
Creator: AksaraLLM
Published: 2026-04-23 21:37:17
License: 暂无描述

Hugging Face2026-04-23 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/AksaraLLM/aksara-sft-clean-v6

下载链接

链接失效反馈

官方服务：

资源简介：

AksaraLLM SFT Clean v6 是一个高质量的印尼语SFT数据集，通过Google Vertex AI从Gemini 2.5 Flash Lite蒸馏而来，具有严格的质量控制。数据集包含17,850个项目，分为训练集（16,752项）和验证集（1,098项）。任务类型包括事实问答（factual_qa）、创意（creative）、文化（cultural）、推理（reasoning）和操作指南（how_to）。数据集通过精心挑选的印尼主题种子和任务模板生成，并经过质量检查（如有效JSON格式、指令长度、输出字符数等）和去重处理。数据集字段包括指令（instruction）、输出（output）、任务类型（task_type）、主题（topic）、来源（source）、种子ID（seed_id）和教师推理（teacher_reasoning）。与v1版本相比，v6版本在质量控制和分布上有所改进，建议混合使用v1和v6版本以获得更好的训练效果。数据集采用Apache 2.0许可证，输出为Gemini生成的合成内容。

AksaraLLM SFT Clean v6 is a high-quality Indonesian SFT dataset distilled from Gemini 2.5 Flash Lite via Google Vertex AI, with strict quality gates. The dataset contains 17,850 items, split into training (16,752 items) and validation (1,098 items) sets. Task types include factual_qa, creative, cultural, reasoning, and how_to. The dataset is generated using curated Indonesian topic seeds and task templates, and undergoes quality checks (e.g., valid JSON format, instruction length, output character count) and deduplication. Fields include instruction, output, task_type, topic, source, seed_id, and teacher_reasoning. Compared to v1, v6 offers improved quality control and distribution, and it is recommended to mix v1 and v6 for better training results. The dataset is licensed under Apache 2.0, with outputs being synthetic content from Gemini.

提供机构：

AksaraLLM

5,000+

优质数据集

54 个

任务类型

进入经典数据集