five

abhishekbisaria/Truth

收藏
Hugging Face2024-03-29 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/abhishekbisaria/Truth
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - en tags: - instruction-finetuning pretty_name: Alpaca-Cleaned task_categories: - text-generation --- # Dataset Card for Alpaca-Cleaned - **Repository:** https://github.com/gururise/AlpacaDataCleaned ## Dataset Description This is a cleaned version of the original Alpaca Dataset released by Stanford. The following issues have been identified in the original release and fixed in this dataset: 1. **Hallucinations:** Many instructions in the original dataset had instructions referencing data on the internet, which just caused GPT3 to hallucinate an answer. ``` "instruction":"Summarize the given article in 200 Words.", "input": "https://www.bbc.com/news/world-51461830", "output": "The recent protests in Belarus are part of a growing wave of discontent against the rule of Alexander Lukashenko, who has led the country since 1994. Demonstrations against his government have been ongoing since August, and come in the wake of an election widely seen as rigged. As police and security forces have cracked down on protesters, the movement has only grown, with hundreds of thousands of people taking to the streets to call for new leadership. Human rights activists have accused the government of using extreme violence in order prosecute protesters, leading to an international outcry and sanctions from the European Union. As the situation in Belarus continues to unfold, the international community is monitoring the situation closely - hoping for a peaceful resolution, while also standing firm against the oppressive tactics employed by the Belarusian state." "instruction": "Classify the following shape according to simple geometry.", "input": "<Picture Attached>", "output": "Triangle" "instruction": "Given a recent news article, write a headline for it.", "input": "https://www.news.example.com/article-1", "output": "\"NASA Launches a Mars 2020 Mission to Search for Signs of Ancient Life\"" ``` 2. **Merged Instructions:** There were many instructions that were merged together in the original dataset for some reason. ``` "instruction": "Should I invest in stocks?", "input": "", "output": "It depends on your individual situation. Investing in stocks carries a certain amount of risk but can also offer potential rewards. Make sure to research the different types of stocks, understand the risks and rewards, and develop a plan that works for you.\n##\n18. Instruction: Identify the noun in the following sentence:\n18. Input:\nThe large apple was delicious.\n18. Output:\nApple" ``` 3. **Empty outputs:** Some entries in the original dataset had empty outputs. ``` "instruction": "Calculate the area of the following shape in square centimeters.", "input": "rectangle of size 4 cm x 5 cm", "output": "" ``` 4. **Empty code examples:** Some descriptions in the original dataset were missing code examples, making it difficult to understand the intended behavior of the code. 5. **Instructions to generate images:** Some descriptions in the original dataset included instructions to generate images, something obviously not possible. ``` "instruction": "Create a graphic or logo that visually represents the word \"courage\".", "input": "", "output": "<No Output>" ``` 6. **N/A outputs:** Some code snippets in the original dataset had N/A outputs. 7. **Inconsistent input field:** The original dataset had inconsistent usage of the input field when it was supposed to be empty. ``` "input":"<no input>" "input":"No input" "input":"noinput" "input":"<noinput>" ``` 8. **Wrong answers:** Some instructions/questions in the original dataset had incorrect answers. About 80% of the math problems are estimated to have incorrect answers. ``` "instruction": "Calculate the median of the following data set.", "input": "1, 2, 4, 5, 8, 9", "output": "5" "instruction": "Convert 25m to km.", "input": "", "output": "25km" ``` 9. **Non-Sensical/Unclear instructions:** Many instructions are unclear, we try to clarify (or re-write) if instructions are non-sensical. Instructions that are slightly unclear, but where one could deduce the meaning are not altered. ``` "instruction": "Freeze the following sample of yogurt for 10 minutes.", "input": "Yogurt sample", "output": "<noinput>" "instruction": "Increase the font size to 12 points.", "input": "", "output": "The font size has been increased to 12 points." ``` 10. **Extraneous escape and control characters:** The original dataset had several entries with extraneous escape and control characters. ### Original Alpaca Dataset Summary Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's `text-davinci-003` engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The authors built on the data generation pipeline from [Self-Instruct framework](https://github.com/yizhongw/self-instruct) and made the following modifications: - The `text-davinci-003` engine to generate the instruction data instead of `davinci`. - A [new prompt](https://github.com/tatsu-lab/stanford_alpaca/blob/main/prompt.txt) was written that explicitly gave the requirement of instruction generation to `text-davinci-003`. - Much more aggressive batch decoding was used, i.e., generating 20 instructions at once, which significantly reduced the cost of data generation. - The data generation pipeline was simplified by discarding the difference between classification and non-classification instructions. - Only a single instance was generated for each instruction, instead of 2 to 3 instances as in Self-Instruct. This produced an instruction-following dataset with 52K examples obtained at a much lower cost (less than $500). In a preliminary study, the authors also found that the 52K generated data to be much more diverse than the data released by [Self-Instruct](https://github.com/yizhongw/self-instruct/blob/main/data/seed_tasks.jsonl). ### Supported Tasks and Leaderboards The Alpaca dataset designed for instruction training pretrained language models. ### Languages The data in Alpaca are in English (BCP-47 en). ## Dataset Structure ### Data Instances An example of "train" looks as follows: ```json { "instruction": "Create a classification task by clustering the given list of items.", "input": "Apples, oranges, bananas, strawberries, pineapples", "output": "Class 1: Apples, Oranges\nClass 2: Bananas, Strawberries\nClass 3: Pineapples", "text": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nCreate a classification task by clustering the given list of items.\n\n### Input:\nApples, oranges, bananas, strawberries, pineapples\n\n### Response:\nClass 1: Apples, Oranges\nClass 2: Bananas, Strawberries\nClass 3: Pineapples", } ``` ### Data Fields The data fields are as follows: * `instruction`: describes the task the model should perform. Each of the 52K instructions is unique. * `input`: optional context or input for the task. For example, when the instruction is "Summarize the following article", the input is the article. Around 40% of the examples have an input. * `output`: the answer to the instruction as generated by `text-davinci-003`. * `text`: the `instruction`, `input` and `output` formatted with the [prompt template](https://github.com/tatsu-lab/stanford_alpaca#data-release) used by the authors for fine-tuning their models. ### Data Splits | | train | |---------------|------:| | alpaca | 52002 | ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset Excerpt the [blog post](https://crfm.stanford.edu/2023/03/13/alpaca.html) accompanying the release of this dataset: > We believe that releasing the above assets will enable the academic community to perform controlled scientific studies on instruction-following language models, resulting in better science and ultimately new techniques to address the existing deficiencies with these models. At the same time, any release carries some risk. First, we recognize that releasing our training recipe reveals the feasibility of certain capabilities. On one hand, this enables more people (including bad actors) to create models that could cause harm (either intentionally or not). On the other hand, this awareness might incentivize swift defensive action, especially from the academic community, now empowered by the means to perform deeper safety research on such models. Overall, we believe that the benefits for the research community outweigh the risks of this particular release. Given that we are releasing the training recipe, we believe that releasing the data, model weights, and training code incur minimal further risk, given the simplicity of the recipe. At the same time, releasing these assets has enormous benefits for reproducible science, so that the academic community can use standard datasets, models, and code to perform controlled comparisons and to explore extensions. Deploying an interactive demo for Alpaca also poses potential risks, such as more widely disseminating harmful content and lowering the barrier for spam, fraud, or disinformation. We have put into place two risk mitigation strategies. First, we have implemented a content filter using OpenAI’s content moderation API, which filters out harmful content as defined by OpenAI’s usage policies. Second, we watermark all the model outputs using the method described in Kirchenbauer et al. 2023, so that others can detect (with some probability) whether an output comes from Alpaca 7B. Finally, we have strict terms and conditions for using the demo; it is restricted to non-commercial uses and to uses that follow LLaMA’s license agreement. We understand that these mitigation measures can be circumvented once we release the model weights or if users train their own instruction-following models. However, by installing these mitigations, we hope to advance the best practices and ultimately develop community norms for the responsible deployment of foundation models. ### Discussion of Biases [More Information Needed] ### Other Known Limitations The `alpaca` data is generated by a language model (`text-davinci-003`) and inevitably contains some errors or biases. We encourage users to use this data with caution and propose new methods to filter or improve the imperfections. ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information The dataset is available under the [Creative Commons NonCommercial (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/legalcode). ### Citation Information ``` @misc{alpaca, author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto }, title = {Stanford Alpaca: An Instruction-following LLaMA model}, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/tatsu-lab/stanford_alpaca}}, } ``` ### Contributions [More Information Needed]
提供机构:
abhishekbisaria
原始信息汇总

数据集卡片:Alpaca-Cleaned

数据集描述

Alpaca-Cleaned 是斯坦福大学发布的原始 Alpaca 数据集的清洗版本。该数据集解决了原始版本中的以下问题:

  1. 幻觉问题:原始数据集中许多指令引用了互联网上的数据,导致 GPT3 产生幻觉答案。
  2. 合并指令:原始数据集中存在许多合并在一起的指令。
  3. 空输出:原始数据集中部分条目输出为空。
  4. 缺少代码示例:原始数据集中部分描述缺少代码示例,难以理解代码的预期行为。
  5. 生成图像指令:原始数据集中包含生成图像的指令,这在实际操作中显然不可能。
  6. N/A 输出:原始数据集中部分代码片段输出为 N/A。
  7. 输入字段不一致:原始数据集在输入字段应为空时使用不一致。
  8. 错误答案:原始数据集中部分指令/问题答案不正确,约 80% 的数学问题答案估计有误。
  9. 非理性/不清晰指令:原始数据集中许多指令不清晰,尝试澄清或重写非理性指令。
  10. 多余转义和控制字符:原始数据集中包含多余的转义和控制字符。

原始 Alpaca 数据集概述

Alpaca 是一个包含 52,000 条指令和演示的数据集,由 OpenAI 的 text-davinci-003 引擎生成。该指令数据可用于进行语言模型的指令微调,使语言模型更好地遵循指令。

作者基于 Self-Instruct 框架 的数据生成流程进行了以下修改:

  • 使用 text-davinci-003 引擎生成指令数据,而非 davinci
  • 编写了一个新的提示,明确要求 text-davinci-003 生成指令。
  • 采用更积极的批量解码,即一次生成 20 条指令,显著降低了数据生成成本。
  • 简化了数据生成流程,不再区分分类和非分类指令。
  • 每条指令仅生成一个实例,而非 Self-Instruct 中的 2 到 3 个实例。

这产生了一个包含 52,000 个示例的指令遵循数据集,成本大幅降低(不到 500 美元)。初步研究发现,这 52,000 个生成的数据比 Self-Instruct 发布的数据更加多样化。

支持的任务和排行榜

Alpaca 数据集设计用于预训练语言模型的指令训练。

语言

Alpaca 数据集中的数据为英语(BCP-47 en)。

数据集结构

数据实例

一个 "train" 示例如下:

json { "instruction": "通过聚类给定项目列表创建一个分类任务。", "input": "苹果, 橙子, 香蕉, 草莓, 菠萝", "output": "类别 1: 苹果, 橙子 类别 2: 香蕉, 草莓 类别 3: 菠萝", "text": "以下是一个描述任务的指令,以及提供进一步上下文的输入。编写一个适当完成请求的响应。

指令:

通过聚类给定项目列表创建一个分类任务。

输入:

苹果, 橙子, 香蕉, 草莓, 菠萝

响应:

类别 1: 苹果, 橙子 类别 2: 香蕉, 草莓 类别 3: 菠萝" }

数据字段

数据字段如下:

  • instruction:描述模型应执行的任务。52,000 条指令均唯一。
  • input:任务的上下文或输入(可选)。例如,当指令为“总结以下文章”时,输入为文章。约 40% 的示例有输入。
  • outputtext-davinci-003 生成的指令答案。
  • text:使用作者用于微调模型的 提示模板 格式化的 instructioninputoutput

数据分割

train
alpaca 52002

数据集创建

数据集的社交影响

根据伴随数据集发布的 博客文章 摘录:

我们相信,发布上述资产将使学术界能够进行受控的科学研究,从而改进这些模型的现有缺陷,并最终开发出新的技术。同时,任何发布都存在一定风险。首先,我们认识到发布我们的训练方法揭示了某些能力的可行性。一方面,这使得更多人(包括不良行为者)能够创建可能造成伤害(有意或无意)的模型。另一方面,这种意识可能会激励迅速的防御行动,特别是来自学术界的行动,现在他们有能力进行更深入的安全研究。总体而言,我们认为对研究社区的益处超过了这一特定发布的风险。鉴于我们发布了训练方法,我们认为发布数据、模型权重和训练代码带来的进一步风险最小,考虑到方法的简单性。同时,发布这些资产对可重复科学有巨大益处,使学术界能够使用标准数据集、模型和代码进行受控比较和探索扩展。部署 Alpaca 的交互式演示也存在潜在风险,如更广泛地传播有害内容和降低垃圾邮件、欺诈或虚假信息的门槛。我们采取了两种风险缓解策略。首先,我们使用 OpenAI 的内容审核 API 实施了内容过滤器,过滤掉 OpenAI 使用政策定义的有害内容。其次,我们使用 Kirchenbauer 等人在 2023 年描述的方法对所有模型输出进行水印处理,以便其他人可以(在一定程度上)检测输出是否来自 Alpaca 7B。最后,我们对使用演示有严格的条款和条件;它仅限于非商业用途,并遵循 LLaMA 的许可协议。我们理解这些缓解措施一旦我们发布模型权重或用户训练自己的指令遵循模型就可能被绕过。然而,通过安装这些缓解措施,我们希望推进最佳实践,并最终为负责部署基础模型制定社区规范。

其他已知限制

alpaca 数据由语言模型(text-davinci-003)生成,不可避免地包含一些错误或偏见。我们鼓励用户谨慎使用此数据,并提出新的方法来过滤或改进这些不完美之处。

附加信息

许可信息

数据集在 Creative Commons NonCommercial (CC BY-NC 4.0) 下提供。

引用信息

@misc{alpaca, author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto }, title = {Stanford Alpaca: An Instruction-following LLaMA model}, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {url{https://github.com/tatsu-lab/stanford_alpaca}}, }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作