AksaraLLM/aksara-sft-clean-v1
收藏Hugging Face2026-04-23 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/AksaraLLM/aksara-sft-clean-v1
下载链接
链接失效反馈官方服务:
资源简介:
AksaraLLM SFT Clean v1是AksaraLLM/aksara-sft-id数据集的清洁版本,经过去重和过滤处理。该数据集主要用于文本生成任务,包含多种来源的数据,如合成维基问答、TyDiQA印尼语问答、印尼语知识问答、11种印尼地区语言的情感分析、通用任务、文化/历史等。数据集排除了含有幻觉输出或事实错误的数据行。数据格式包括instruction、output、source和task_type字段。数据集采用基于哈希的确定性分割方法,分为训练集和验证集。未来版本计划增加推理、数学、代码、多轮对话、摘要、翻译、安全拒绝和创意写作等任务的数据。
AksaraLLM SFT Clean v1 is a cleaned version of the AksaraLLM/aksara-sft-id dataset, with deduplication and filtering applied. This dataset is primarily used for text generation tasks and includes data from various sources such as synthetic wiki QA, TyDiQA Indonesian QA, Indonesian knowledge QA, sentiment analysis in 11 regional languages, general tasks, culture/history, etc. The dataset excludes rows with hallucinated outputs or factual errors. The data format includes fields for instruction, output, source, and task_type. The dataset uses a hash-based deterministic split method, divided into training and validation sets. Future versions plan to add data for tasks such as reasoning, math, code, multi-turn dialogue, summarization, translation, safety refusal, and creative writing.
提供机构:
AksaraLLM



