alpaca-cleaned

Name: alpaca-cleaned
Creator: maas
Published: 2026-05-16 08:18:32
License: 暂无描述

魔搭社区2026-05-16 更新2024-06-01 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/alpaca-cleaned

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Alpaca-Cleaned - **Repository:** https://github.com/gururise/AlpacaDataCleaned ## Dataset Description This is a cleaned version of the original Alpaca Dataset released by Stanford. The following issues have been identified in the original release and fixed in this dataset: 1. **Hallucinations:** Many instructions in the original dataset had instructions referencing data on the internet, which just caused GPT3 to hallucinate an answer. ``` "instruction":"Summarize the given article in 200 Words.", "input": "https://www.bbc.com/news/world-51461830", "output": "The recent protests in Belarus are part of a growing wave of discontent against the rule of Alexander Lukashenko, who has led the country since 1994. Demonstrations against his government have been ongoing since August, and come in the wake of an election widely seen as rigged. As police and security forces have cracked down on protesters, the movement has only grown, with hundreds of thousands of people taking to the streets to call for new leadership. Human rights activists have accused the government of using extreme violence in order prosecute protesters, leading to an international outcry and sanctions from the European Union. As the situation in Belarus continues to unfold, the international community is monitoring the situation closely - hoping for a peaceful resolution, while also standing firm against the oppressive tactics employed by the Belarusian state." "instruction": "Classify the following shape according to simple geometry.", "input": "<Picture Attached>", "output": "Triangle" "instruction": "Given a recent news article, write a headline for it.", "input": "https://www.news.example.com/article-1", "output": "\"NASA Launches a Mars 2020 Mission to Search for Signs of Ancient Life\"" ``` 2. **Merged Instructions:** There were many instructions that were merged together in the original dataset for some reason. ``` "instruction": "Should I invest in stocks?", "input": "", "output": "It depends on your individual situation. Investing in stocks carries a certain amount of risk but can also offer potential rewards. Make sure to research the different types of stocks, understand the risks and rewards, and develop a plan that works for you.\n##\n18. Instruction: Identify the noun in the following sentence:\n18. Input:\nThe large apple was delicious.\n18. Output:\nApple" ``` 3. **Empty outputs:** Some entries in the original dataset had empty outputs. ``` "instruction": "Calculate the area of the following shape in square centimeters.", "input": "rectangle of size 4 cm x 5 cm", "output": "" ``` 4. **Empty code examples:** Some descriptions in the original dataset were missing code examples, making it difficult to understand the intended behavior of the code. 5. **Instructions to generate images:** Some descriptions in the original dataset included instructions to generate images, something obviously not possible. ``` "instruction": "Create a graphic or logo that visually represents the word \"courage\".", "input": "", "output": "<No Output>" ``` 6. **N/A outputs:** Some code snippets in the original dataset had N/A outputs. 7. **Inconsistent input field:** The original dataset had inconsistent usage of the input field when it was supposed to be empty. ``` "input":"<no input>" "input":"No input" "input":"noinput" "input":"<noinput>" ``` 8. **Wrong answers:** Some instructions/questions in the original dataset had incorrect answers. About 80% of the math problems are estimated to have incorrect answers. ``` "instruction": "Calculate the median of the following data set.", "input": "1, 2, 4, 5, 8, 9", "output": "5" "instruction": "Convert 25m to km.", "input": "", "output": "25km" ``` 9. **Non-Sensical/Unclear instructions:** Many instructions are unclear, we try to clarify (or re-write) if instructions are non-sensical. Instructions that are slightly unclear, but where one could deduce the meaning are not altered. ``` "instruction": "Freeze the following sample of yogurt for 10 minutes.", "input": "Yogurt sample", "output": "<noinput>" "instruction": "Increase the font size to 12 points.", "input": "", "output": "The font size has been increased to 12 points." ``` 10. **Extraneous escape and control characters:** The original dataset had several entries with extraneous escape and control characters. ### Original Alpaca Dataset Summary Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's `text-davinci-003` engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The authors built on the data generation pipeline from [Self-Instruct framework](https://github.com/yizhongw/self-instruct) and made the following modifications: - The `text-davinci-003` engine to generate the instruction data instead of `davinci`. - A [new prompt](https://github.com/tatsu-lab/stanford_alpaca/blob/main/prompt.txt) was written that explicitly gave the requirement of instruction generation to `text-davinci-003`. - Much more aggressive batch decoding was used, i.e., generating 20 instructions at once, which significantly reduced the cost of data generation. - The data generation pipeline was simplified by discarding the difference between classification and non-classification instructions. - Only a single instance was generated for each instruction, instead of 2 to 3 instances as in Self-Instruct. This produced an instruction-following dataset with 52K examples obtained at a much lower cost (less than $500). In a preliminary study, the authors also found that the 52K generated data to be much more diverse than the data released by [Self-Instruct](https://github.com/yizhongw/self-instruct/blob/main/data/seed_tasks.jsonl). ### Supported Tasks and Leaderboards The Alpaca dataset designed for instruction training pretrained language models. ### Languages The data in Alpaca are in English (BCP-47 en). ## Dataset Structure ### Data Instances An example of "train" looks as follows: ```json { "instruction": "Create a classification task by clustering the given list of items.", "input": "Apples, oranges, bananas, strawberries, pineapples", "output": "Class 1: Apples, Oranges\nClass 2: Bananas, Strawberries\nClass 3: Pineapples", "text": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nCreate a classification task by clustering the given list of items.\n\n### Input:\nApples, oranges, bananas, strawberries, pineapples\n\n### Response:\nClass 1: Apples, Oranges\nClass 2: Bananas, Strawberries\nClass 3: Pineapples", } ``` ### Data Fields The data fields are as follows: * `instruction`: describes the task the model should perform. Each of the 52K instructions is unique. * `input`: optional context or input for the task. For example, when the instruction is "Summarize the following article", the input is the article. Around 40% of the examples have an input. * `output`: the answer to the instruction as generated by `text-davinci-003`. * `text`: the `instruction`, `input` and `output` formatted with the [prompt template](https://github.com/tatsu-lab/stanford_alpaca#data-release) used by the authors for fine-tuning their models. ### Data Splits | | train | |---------------|------:| | alpaca | 52002 | ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset Excerpt the [blog post](https://crfm.stanford.edu/2023/03/13/alpaca.html) accompanying the release of this dataset: > We believe that releasing the above assets will enable the academic community to perform controlled scientific studies on instruction-following language models, resulting in better science and ultimately new techniques to address the existing deficiencies with these models. At the same time, any release carries some risk. First, we recognize that releasing our training recipe reveals the feasibility of certain capabilities. On one hand, this enables more people (including bad actors) to create models that could cause harm (either intentionally or not). On the other hand, this awareness might incentivize swift defensive action, especially from the academic community, now empowered by the means to perform deeper safety research on such models. Overall, we believe that the benefits for the research community outweigh the risks of this particular release. Given that we are releasing the training recipe, we believe that releasing the data, model weights, and training code incur minimal further risk, given the simplicity of the recipe. At the same time, releasing these assets has enormous benefits for reproducible science, so that the academic community can use standard datasets, models, and code to perform controlled comparisons and to explore extensions. Deploying an interactive demo for Alpaca also poses potential risks, such as more widely disseminating harmful content and lowering the barrier for spam, fraud, or disinformation. We have put into place two risk mitigation strategies. First, we have implemented a content filter using OpenAI’s content moderation API, which filters out harmful content as defined by OpenAI’s usage policies. Second, we watermark all the model outputs using the method described in Kirchenbauer et al. 2023, so that others can detect (with some probability) whether an output comes from Alpaca 7B. Finally, we have strict terms and conditions for using the demo; it is restricted to non-commercial uses and to uses that follow LLaMA’s license agreement. We understand that these mitigation measures can be circumvented once we release the model weights or if users train their own instruction-following models. However, by installing these mitigations, we hope to advance the best practices and ultimately develop community norms for the responsible deployment of foundation models. ### Discussion of Biases [More Information Needed] ### Other Known Limitations The `alpaca` data is generated by a language model (`text-davinci-003`) and inevitably contains some errors or biases. We encourage users to use this data with caution and propose new methods to filter or improve the imperfections. ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information The dataset is available under the [Creative Commons NonCommercial (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/legalcode). ### Citation Information ``` @misc{alpaca, author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto }, title = {Stanford Alpaca: An Instruction-following LLaMA model}, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/tatsu-lab/stanford_alpaca}}, } ``` ### Contributions [More Information Needed]

# 阿尔帕卡（Alpaca-Cleaned）数据集卡片 - **仓库地址：** https://github.com/gururise/AlpacaDataCleaned ## 数据集描述本数据集是斯坦福发布的原始阿尔帕卡（Alpaca）数据集的清理版本。原始数据集存在以下若干问题，本数据集已针对这些问题完成修复： 1. **幻觉问题：** 原始数据集中的大量指令引用了互联网数据，这会导致GPT-3生成与事实不符的幻觉式回答。 "instruction":"用200词总结以下文章", "input": "https://www.bbc.com/news/world-51461830", "output": "近期白俄罗斯的抗议活动是针对亚历山大·卢卡申科统治的日益增长的不满浪潮的一部分，他自1994年以来一直领导该国。自8月以来，针对他的政府的示威活动持续不断，此前一场被广泛认为存在舞弊的选举引发了此次抗议。随着警方和安全部队镇压抗议者，运动反而愈演愈烈，数十万人走上街头呼吁更换领导人。人权活动人士指责政府使用极端暴力手段起诉抗议者，引发了国际社会的强烈抗议和欧盟的制裁。随着白俄罗斯局势持续发展，国际社会正密切关注局势，希望实现和平解决方案，同时坚定反对白俄罗斯政府采用的镇压手段。" "instruction": "根据简单几何分类以下形状", "input": "<附图片>", "output": "三角形" "instruction": "为以下近期新闻文章撰写标题", "input": "https://www.news.example.com/article-1", "output": ""NASA启动火星2020任务寻找远古生命迹象"" 2. **指令合并问题：** 原始数据集中存在大量因未知原因被合并的指令。 "instruction": "我应该投资股票吗？", "input": "", "output": "这取决于你的个人情况。投资股票存在一定风险，但也可能带来潜在回报。请务必研究不同类型的股票，了解其风险与回报，并制定适合自己的计划。 ## 18. 指令：识别以下句子中的名词： 18. 输入：这个大苹果很美味。 18. 输出：苹果" 3. **空输出问题：** 原始数据集中部分条目存在空输出。 "instruction": "计算以下形状的面积，单位为平方厘米。", "input": "尺寸为4厘米×5厘米的长方形", "output": "" 4. **缺失代码示例：** 原始数据集中部分描述缺少代码示例，导致难以理解其预期的代码行为。 5. **图像生成指令：** 原始数据集中包含部分无法实现的图像生成类指令。 "instruction": "创建一个视觉上体现‘勇气’一词的图形或标志", "input": "", "output": "<无输出>" 6. **N/A输出：** 原始数据集中部分代码片段存在N/A类型的输出。 7. **输入字段不一致：** 原始数据集在输入字段应为空时的使用方式存在不一致问题。 "input":"<无输入>" "input":"无输入" "input":"无输入" "input":"<无输入>" 8. **答案错误：** 原始数据集中部分指令/问题存在错误答案。据估算，约80%的数学题答案存在错误。 "instruction": "计算以下数据集的中位数", "input": "1, 2, 4, 5, 8, 9", "output": "5" "instruction": "将25米转换为千米", "input": "", "output": "25km" 9. **无意义/模糊指令：** 原始数据集中存在大量模糊不清的指令，本数据集会对无意义的指令进行澄清或重写；对于仅略有模糊但可推断含义的指令则不予修改。 "instruction": "将以下酸奶样品冷冻10分钟", "input": "酸奶样品", "output": "<无输入>" "instruction": "将字体大小调整为12磅", "input": "", "output": "字体大小已调整为12磅" 10. **多余转义与控制字符：** 原始数据集中存在大量包含多余转义字符与控制字符的条目。 ### 原始阿尔帕卡数据集概况阿尔帕卡（Alpaca）是一个包含52000条指令与演示示例的数据集，由OpenAI的`text-davinci-003`模型生成。该指令数据可用于对语言模型进行指令微调，使模型更好地遵循用户指令。该数据集的作者基于[Self-Instruct框架](https://github.com/yizhongw/self-instruct)的数据生成流程，做出了如下改进： - 使用`text-davinci-003`模型生成指令数据，而非原始的`davinci`模型。 - 编写了[全新提示词](https://github.com/tatsu-lab/stanford_alpaca/blob/main/prompt.txt)，向`text-davinci-003`明确说明指令生成的要求。 - 采用了更激进的批量解码策略，即单次生成20条指令，大幅降低了数据生成的成本。 - 简化了数据生成流程，移除了分类指令与非分类指令之间的差异。 - 每条指令仅生成单个实例，而非Self-Instruct框架中的2至3个实例。最终生成了包含52000条遵循指令的示例数据集，生成成本极低（不足500美元）。在初步研究中，作者还发现该52000条生成数据比[Self-Instruct](https://github.com/yizhongw/self-instruct/blob/main/data/seed_tasks.jsonl)发布的数据具有更高的多样性。 ### 支持任务与基准测试集阿尔帕卡数据集专为预训练语言模型的指令微调任务设计。 ### 语言阿尔帕卡数据集的文本语言为英语（BCP-47 标记：en）。 ## 数据集结构 ### 数据实例以下是一个训练集示例： json { "instruction": "通过对给定项目列表进行聚类创建分类任务", "input": "苹果、橙子、香蕉、草莓、菠萝", "output": "类别1：苹果、橙子类别2：香蕉、草莓类别3：菠萝", "text": "以下是描述一项任务的指令，以及提供进一步上下文的输入。请编写一个适当的响应来完成该请求。 ### 指令：通过对给定项目列表进行聚类创建分类任务 ### 输入：苹果、橙子、香蕉、草莓、菠萝 ### 响应：类别1：苹果、橙子类别2：香蕉、草莓类别3：菠萝", } ### 数据字段数据集包含以下字段： * `instruction`：描述模型需要执行的任务，52000条指令均为唯一内容。 * `input`：任务的可选上下文或输入。例如，当指令为“总结以下文章”时，输入字段即为该文章。约40%的示例包含输入字段。 * `output`：由`text-davinci-003`生成的指令对应回答。 * `text`：按照作者用于模型微调的[提示词模板](https://github.com/tatsu-lab/stanford_alpaca#data-release)格式化后的`instruction`、`input`与`output`内容。 ### 数据划分 | | 训练集 | |---------------|------:| | alpaca | 52002 | ## 数据集创建 ### 筛选依据 [需补充更多信息] ### 源数据 #### 初始数据收集与标准化 [需补充更多信息] #### 源语言生产者是谁？ [需补充更多信息] ### 标注 #### 标注流程 [需补充更多信息] #### 标注者是谁？ [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据使用注意事项 ### 数据集的社会影响本数据集发布时附带的[博客文章](https://crfm.stanford.edu/2023/03/13/alpaca.html)节选如下： > 我们认为，发布上述相关资源将有助于学术界对遵循指令的语言模型开展可控的科学研究，从而推动相关科学研究的进步，并最终开发出解决现有模型缺陷的新技术。与此同时，任何形式的发布都存在一定风险。首先，我们意识到发布我们的训练流程证明了某些能力的可行性。一方面，这会让更多人（包括恶意使用者）开发出可能造成危害的模型（无论是有意还是无意）。另一方面，这种认知可能会推动快速的防御性行动，尤其是学术界，现在他们有能力对这类模型开展更深入的安全研究。总体而言，我们认为本次发布的益处大于风险。鉴于我们已经发布了训练流程，我们认为发布数据集、模型权重与训练代码带来的额外风险极小，因为该流程本身已足够简单。与此同时，发布这些资源对可复现的科学研究具有巨大价值，学术界可以使用标准的数据集、模型与代码开展可控的对比研究，并探索扩展方向。部署阿尔帕卡的交互式演示也存在潜在风险，例如更广泛地传播有害内容、降低垃圾信息、欺诈或虚假信息的门槛。我们已经采取了两项风险缓解策略。首先，我们使用OpenAI的内容审核API实现了内容过滤，会过滤掉OpenAI使用政策中定义的有害内容。其次，我们使用Kirchenbauer等人2023年提出的方法对所有模型输出添加了水印，以便他人（以一定概率）检测某条输出是否来自阿尔帕卡7B模型。最后，我们对演示的使用制定了严格的条款与条件，仅允许非商业用途以及符合LLaMA许可协议的使用。我们理解，一旦我们发布模型权重或用户训练自己的遵循指令的模型，这些缓解措施可能会被绕过。但我们希望通过安装这些措施，推广最佳实践，并最终为基础模型的负责任部署建立社区规范。 ### 偏差讨论 [需补充更多信息] ### 其他已知局限性阿尔帕卡数据集由语言模型（`text-davinci-003`）生成，不可避免地包含部分错误或偏差。我们鼓励用户谨慎使用该数据集，并提出新的方法来过滤或改进这些不完善之处。 ## 附加信息 ### 数据集维护者 [需补充更多信息] ### 许可信息本数据集采用[知识共享非商业许可（CC BY-NC 4.0）](https://creativecommons.org/licenses/by-nc/4.0/legalcode)发布。 ### 引用信息 @misc{alpaca, author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto }, title = {Stanford Alpaca: An Instruction-following LLaMA model}, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {url{https://github.com/tatsu-lab/stanford_alpaca}}, } ### 贡献 [需补充更多信息]

提供机构：

maas

创建时间：

2024-05-09

搜集汇总

数据集介绍