five

afrizalha/Tumpeng-1-Indonesian

收藏
Hugging Face2024-06-08 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/afrizalha/Tumpeng-1-Indonesian
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 task_categories: - text-generation language: - id size_categories: - 10K<n<100K --- # Synthetic Indonesian dataset with Llama 3 70B Tumpeng contains 14.8M words of 48.6K input-output pairs of Indonesian question-answering. It is intended to fine-tune Llama 3 8B, which has limited Indonesian language capabilities, to properly respond in Indonesian. It is a research preview dataset and not curated for factual accuracy or safety. Use this dataset at your discretion. # Out of scope use - Commercial use - Fine-tuning non-Llama 3 models # Supported tasks - General QA across a variety of domains - Contextual QA (copy-paste a text and ask a question about the contents in the text) - Multi-turn conversation (alpha; optimized for personal advice) - Writing (outlines and full-articles) # Conversational format This dataset contains experimental multi-turn conversation, with <|user|> and <|assistant|> as the user and LLM headers respectively. For proper formatting, please use the following template: ``` <|user|> {promt} <|assistant|> {response} ``` Alternatively, you can modify the strings in the dataset according to your intended format.
提供机构:
afrizalha
原始信息汇总

数据集概述

基本信息

  • 名称: Synthetic Indonesian dataset with Llama 3 70B
  • 语言: 印尼语(id)
  • 许可: CC-BY-NC-4.0
  • 大小: 10K<n<100K

数据内容

  • 包含: 14.8M 单词,48.6K 输入-输出对
  • 用途: 用于微调 Llama 3 8B 模型,以提升其在印尼语环境下的响应能力

数据特点

  • 类型: 研究预览数据集,未经过事实准确性或安全性筛选
  • 使用限制: 非商业用途,仅限微调 Llama 3 模型

支持的任务

  • 通用领域问答
  • 上下文问答
  • 多轮对话(优化于个人建议)
  • 写作(大纲和完整文章)

格式

  • 对话格式: 包含实验性多轮对话,使用 <|user|><|assistant|> 分别作为用户和语言模型的标识

  • 模板:

    <|user|> {promt}

    <|assistant|> {response}

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作