five

Thaweewat/alpaca-cleaned-52k-th

收藏
Hugging Face2023-05-09 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Thaweewat/alpaca-cleaned-52k-th
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-3.0 task_categories: - question-answering - summarization tags: - instruction-finetuning language: - th size_categories: - 10K<n<100K --- # Summary This is a Thai 🇹🇭-instructed dataset translated from cleaned version of the original Alpaca Dataset released by Stanford using Google Cloud Translation, contain 52,000 instructions and demonstrations generated by OpenAI's `text-davinci-003` engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The following issues have been identified in the original release and fixed in this dataset: 1. **Hallucinations:** Many instructions in the original dataset had instructions referencing data on the internet, which just caused GPT3 to hallucinate an answer. 2. **Merged Instructions:** There were many instructions that were merged together in the original dataset for some reason. 3. **Empty outputs:** Some entries in the original dataset had empty outputs. 4. **Empty code examples:** Some descriptions in the original dataset were missing code examples, making it difficult to understand the intended behavior of the code. 5. **Instructions to generate images:** Some descriptions in the original dataset included instructions to generate images, something obviously not possible. 6. **N/A outputs:** Some code snippets in the original dataset had N/A outputs. 7. **Inconsistent input field:** The original dataset had inconsistent usage of the input field when it was supposed to be empty. 8. **Wrong answers:** Some instructions/questions in the original dataset had incorrect answers. About 80% of the math problems are estimated to have incorrect answers. 9. **Non-Sensical/Unclear instructions:** Many instructions are unclear, we try to clarify (or re-write) if instructions are non-sensical. Instructions that are slightly unclear, but where one could deduce the meaning are not altered. 10. **Extraneous escape and control characters:** The original dataset had several entries with extraneous escape and control characters. ### Original Alpaca Dataset Summary Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's `text-davinci-003` engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The authors built on the data generation pipeline from [Self-Instruct framework](https://github.com/yizhongw/self-instruct) and made the following modifications: - The `text-davinci-003` engine to generate the instruction data instead of `davinci`. - A [new prompt](https://github.com/tatsu-lab/stanford_alpaca/blob/main/prompt.txt) was written that explicitly gave the requirement of instruction generation to `text-davinci-003`. - Much more aggressive batch decoding was used, i.e., generating 20 instructions at once, which significantly reduced the cost of data generation. - The data generation pipeline was simplified by discarding the difference between classification and non-classification instructions. - Only a single instance was generated for each instruction, instead of 2 to 3 instances as in Self-Instruct. The authors built on the data generation pipeline from [Self-Instruct framework](https://github.com/yizhongw/self-instruct) and made the following modifications: - The `text-davinci-003` engine to generate the instruction data instead of `davinci`. - A [new prompt](https://github.com/tatsu-lab/stanford_alpaca/blob/main/prompt.txt) was written that explicitly gave the requirement of instruction generation to `text-davinci-003`. - Much more aggressive batch decoding was used, i.e., generating 20 instructions at once, which significantly reduced the cost of data generation. - The data generation pipeline was simplified by discarding the difference between classification and non-classification instructions. - Only a single instance was generated for each instruction, instead of 2 to 3 instances as in Self-Instruct. This produced an instruction-following dataset with 52K examples obtained at a much lower cost (less than $500). In a preliminary study, the authors also found that the 52K generated data to be much more diverse than the data released by [Self-Instruct](https://github.com/yizhongw/self-instruct/blob/main/data/seed_tasks.jsonl). Supported Tasks: - Training LLMs - Synthetic Data Generation - Data Augmentation Languages: Thai Version: 1.0 ---
提供机构:
Thaweewat
原始信息汇总

数据集概述

基本信息

  • 许可证: cc-by-sa-3.0
  • 任务类别:
    • 问答
    • 摘要
  • 标签: 指令微调
  • 语言: 泰语
  • 数据集大小: 10K<n<100K

数据集内容

  • 来源: 由斯坦福大学发布的原始Alpaca数据集经过清理并通过Google Cloud Translation翻译成泰语。
  • 生成方式: 使用OpenAI的text-davinci-003引擎生成52,000条指令和演示。
  • 用途: 用于语言模型的指令微调,使语言模型更好地遵循指令。

数据集改进

  • 问题修复:
    • 原数据集中的幻觉问题已解决。
    • 合并的指令已分开。
    • 空输出已填充。
    • 缺少代码示例的描述已补充。
    • 无法实现的图像生成指令已移除。
    • N/A输出已处理。
    • 输入字段的不一致使用已修正。
    • 错误的答案已更正。
    • 非理性的/不清晰的指令已澄清或重写。
    • 多余的转义和控制字符已移除。

数据集生成细节

  • 生成引擎: OpenAI的text-davinci-003
  • 生成策略: 使用新的提示和更激进的批量解码,每次生成20条指令,简化数据生成流程,仅生成单个实例而非2到3个。
  • 成本: 生成52K数据集的成本低于$500。
  • 多样性: 比Self-Instruct发布的数据更丰富多样。

支持的任务

  • 训练LLMs
  • 合成数据生成
  • 数据增强

版本

  • 版本: 1.0
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作