afrizalha/Tumpeng-1-Indonesian
收藏Hugging Face2024-06-08 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/afrizalha/Tumpeng-1-Indonesian
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
task_categories:
- text-generation
language:
- id
size_categories:
- 10K<n<100K
---
# Synthetic Indonesian dataset with Llama 3 70B
Tumpeng contains 14.8M words of 48.6K input-output pairs of Indonesian question-answering. It is intended to fine-tune Llama 3 8B, which has limited Indonesian language capabilities, to properly respond in Indonesian.
It is a research preview dataset and not curated for factual accuracy or safety. Use this dataset at your discretion.
# Out of scope use
- Commercial use
- Fine-tuning non-Llama 3 models
# Supported tasks
- General QA across a variety of domains
- Contextual QA (copy-paste a text and ask a question about the contents in the text)
- Multi-turn conversation (alpha; optimized for personal advice)
- Writing (outlines and full-articles)
# Conversational format
This dataset contains experimental multi-turn conversation, with <|user|> and <|assistant|> as the user and LLM headers respectively. For proper formatting, please use the following template:
```
<|user|>
{promt}
<|assistant|>
{response}
```
Alternatively, you can modify the strings in the dataset according to your intended format.
提供机构:
afrizalha
原始信息汇总
数据集概述
基本信息
- 名称: Synthetic Indonesian dataset with Llama 3 70B
- 语言: 印尼语(id)
- 许可: CC-BY-NC-4.0
- 大小: 10K<n<100K
数据内容
- 包含: 14.8M 单词,48.6K 输入-输出对
- 用途: 用于微调 Llama 3 8B 模型,以提升其在印尼语环境下的响应能力
数据特点
- 类型: 研究预览数据集,未经过事实准确性或安全性筛选
- 使用限制: 非商业用途,仅限微调 Llama 3 模型
支持的任务
- 通用领域问答
- 上下文问答
- 多轮对话(优化于个人建议)
- 写作(大纲和完整文章)
格式
-
对话格式: 包含实验性多轮对话,使用
<|user|>和<|assistant|>分别作为用户和语言模型的标识 -
模板:
<|user|> {promt}
<|assistant|> {response}



