five

pss8093/alpaca

收藏
Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/pss8093/alpaca
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 language: - en tags: - instruction-finetuning pretty_name: Alpaca task_categories: - text-generation --- # Dataset Card for Alpaca ## Dataset Description - **Homepage:** https://crfm.stanford.edu/2023/03/13/alpaca.html - **Repository:** https://github.com/tatsu-lab/stanford_alpaca - **Paper:** - **Leaderboard:** - **Point of Contact:** Rohan Taori ### Dataset Summary Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's `text-davinci-003` engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The authors built on the data generation pipeline from [Self-Instruct framework](https://github.com/yizhongw/self-instruct) and made the following modifications: - The `text-davinci-003` engine to generate the instruction data instead of `davinci`. - A [new prompt](https://github.com/tatsu-lab/stanford_alpaca/blob/main/prompt.txt) was written that explicitly gave the requirement of instruction generation to `text-davinci-003`. - Much more aggressive batch decoding was used, i.e., generating 20 instructions at once, which significantly reduced the cost of data generation. - The data generation pipeline was simplified by discarding the difference between classification and non-classification instructions. - Only a single instance was generated for each instruction, instead of 2 to 3 instances as in Self-Instruct. This produced an instruction-following dataset with 52K examples obtained at a much lower cost (less than $500). In a preliminary study, the authors also found that the 52K generated data to be much more diverse than the data released by [Self-Instruct](https://github.com/yizhongw/self-instruct/blob/main/data/seed_tasks.jsonl). ### Supported Tasks and Leaderboards The Alpaca dataset designed for instruction training pretrained language models. ### Languages The data in Alpaca are in English (BCP-47 en). ## Dataset Structure ### Data Instances An example of "train" looks as follows: ```json { "instruction": "Create a classification task by clustering the given list of items.", "input": "Apples, oranges, bananas, strawberries, pineapples", "output": "Class 1: Apples, Oranges\nClass 2: Bananas, Strawberries\nClass 3: Pineapples", "text": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nCreate a classification task by clustering the given list of items.\n\n### Input:\nApples, oranges, bananas, strawberries, pineapples\n\n### Response:\nClass 1: Apples, Oranges\nClass 2: Bananas, Strawberries\nClass 3: Pineapples", } ``` ### Data Fields The data fields are as follows: * `instruction`: describes the task the model should perform. Each of the 52K instructions is unique. * `input`: optional context or input for the task. For example, when the instruction is "Summarize the following article", the input is the article. Around 40% of the examples have an input. * `output`: the answer to the instruction as generated by `text-davinci-003`. * `text`: the `instruction`, `input` and `output` formatted with the [prompt template](https://github.com/tatsu-lab/stanford_alpaca#data-release) used by the authors for fine-tuning their models. ### Data Splits | | train | |---------------|------:| | alpaca | 52002 | ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset Excerpt the [blog post](https://crfm.stanford.edu/2023/03/13/alpaca.html) accompanying the release of this dataset: > We believe that releasing the above assets will enable the academic community to perform controlled scientific studies on instruction-following language models, resulting in better science and ultimately new techniques to address the existing deficiencies with these models. At the same time, any release carries some risk. First, we recognize that releasing our training recipe reveals the feasibility of certain capabilities. On one hand, this enables more people (including bad actors) to create models that could cause harm (either intentionally or not). On the other hand, this awareness might incentivize swift defensive action, especially from the academic community, now empowered by the means to perform deeper safety research on such models. Overall, we believe that the benefits for the research community outweigh the risks of this particular release. Given that we are releasing the training recipe, we believe that releasing the data, model weights, and training code incur minimal further risk, given the simplicity of the recipe. At the same time, releasing these assets has enormous benefits for reproducible science, so that the academic community can use standard datasets, models, and code to perform controlled comparisons and to explore extensions. Deploying an interactive demo for Alpaca also poses potential risks, such as more widely disseminating harmful content and lowering the barrier for spam, fraud, or disinformation. We have put into place two risk mitigation strategies. First, we have implemented a content filter using OpenAI’s content moderation API, which filters out harmful content as defined by OpenAI’s usage policies. Second, we watermark all the model outputs using the method described in Kirchenbauer et al. 2023, so that others can detect (with some probability) whether an output comes from Alpaca 7B. Finally, we have strict terms and conditions for using the demo; it is restricted to non-commercial uses and to uses that follow LLaMA’s license agreement. We understand that these mitigation measures can be circumvented once we release the model weights or if users train their own instruction-following models. However, by installing these mitigations, we hope to advance the best practices and ultimately develop community norms for the responsible deployment of foundation models. ### Discussion of Biases [More Information Needed] ### Other Known Limitations The `alpaca` data is generated by a language model (`text-davinci-003`) and inevitably contains some errors or biases. We encourage users to use this data with caution and propose new methods to filter or improve the imperfections. ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information The dataset is available under the [Creative Commons NonCommercial (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/legalcode). ### Citation Information ``` @misc{alpaca, author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto }, title = {Stanford Alpaca: An Instruction-following LLaMA model}, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/tatsu-lab/stanford_alpaca}}, } ``` ### Contributions [More Information Needed]

license: 知识共享署名-非商业性使用4.0国际许可协议(CC BY-NC 4.0) language: - en tags: - 指令微调(instruction-finetuning) pretty_name: Alpaca task_categories: - 文本生成(text-generation) # Alpaca数据集卡片 ## 数据集描述 - **主页:** https://crfm.stanford.edu/2023/03/13/alpaca.html - **代码仓库:** https://github.com/tatsu-lab/stanford_alpaca - **论文:** - **排行榜:** - **联系人:** Rohan Taori ### 数据集概览 Alpaca是一个包含52000条指令与演示样本的数据集,由OpenAI的`text-davinci-003`模型生成。该指令数据可用于对大语言模型(Large Language Model, LLM)进行指令微调,以提升模型的指令遵循能力。 作者基于Self-Instruct框架的数据生成流程进行了如下改进: - 改用`text-davinci-003`模型生成指令数据,而非原始的`davinci`模型; - 编写了[全新提示词](https://github.com/tatsu-lab/stanford_alpaca/blob/main/prompt.txt),向`text-davinci-003`明确指定指令生成的相关要求; - 采用更为激进的批量解码策略,即单次生成20条指令,大幅降低了数据生成成本; - 简化了数据生成流程,移除了分类指令与非分类指令之间的差异; - 每条指令仅生成单一样本,而非Self-Instruct原框架中的2至3个样本。 该流程最终生成了包含52K条样本的指令遵循数据集,生成成本仅不足500美元。在初步研究中,作者还发现该52K条生成数据相较于Self-Instruct发布的[seed_tasks.jsonl数据集](https://github.com/yizhongw/self-instruct/blob/main/data/seed_tasks.jsonl),多样性更为丰富。 ### 支持的任务与排行榜 Alpaca数据集专为预训练语言模型的指令微调任务设计。 ### 语言 Alpaca数据集采用英语(BCP-47编码为en)。 ## 数据集结构 ### 数据样例 训练集(train)的一个示例如下: json { "instruction": "Create a classification task by clustering the given list of items.", "input": "Apples, oranges, bananas, strawberries, pineapples", "output": "Class 1: Apples, Oranges Class 2: Bananas, Strawberries Class 3: Pineapples", "text": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: Create a classification task by clustering the given list of items. ### Input: Apples, oranges, bananas, strawberries, pineapples ### Response: Class 1: Apples, Oranges Class 2: Bananas, Strawberries Class 3: Pineapples", } ### 数据字段 各数据字段说明如下: * `instruction`:描述模型应执行的任务,52K条指令均唯一; * `input`:任务的可选上下文或输入。例如当指令为"总结以下文章"时,输入即为该文章。约40%的样本包含输入字段; * `output`:由`text-davinci-003`生成的指令对应答案; * `text`:按照作者用于模型微调的[提示模板](https://github.com/tatsu-lab/stanford_alpaca#data-release)格式化后的`instruction`、`input`与`output`内容。 ### 数据划分 | | 训练集 | |---------------|------:| | alpaca | 52002 | ## 数据集构建 ### 筛选依据 [需补充更多信息] ### 源数据 #### 初始数据收集与标准化 [需补充更多信息] #### 源语言生产者是谁? [需补充更多信息] ### 标注信息 #### 标注流程 [需补充更多信息] #### 标注人员是谁? [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 摘录自该数据集发布时配套的[博客文章](https://crfm.stanford.edu/2023/03/13/alpaca.html): > 我们认为,发布上述资源将使学术界能够对指令遵循型语言模型开展可控的科学研究,从而推动相关领域研究的发展,并最终催生解决现有模型缺陷的新技术。与此同时,任何形式的发布都存在一定风险。首先,我们意识到发布我们的训练流程会揭示某些能力的可行性。一方面,这会让更多人(包括恶意使用者)能够创建可能造成危害(有意或无意)的模型;另一方面,这种认知可能会推动快速的防御性行动,尤其是学术界,如今他们已有能力对这类模型开展更深入的安全研究。总体而言,我们认为对于研究社区来说,本次发布的收益大于风险。鉴于我们已经发布了训练流程,我们认为发布数据、模型权重与训练代码带来的额外风险极小,因为该流程本身已足够简单。与此同时,发布这些资源对可复现性研究具有巨大价值,以便学术界能够使用标准的数据集、模型与代码开展可控的对比实验并探索扩展方法。部署Alpaca的交互式演示也存在潜在风险,例如更广泛地传播有害内容,降低垃圾信息、欺诈或虚假信息的门槛。我们已采取两项风险缓解策略。首先,我们使用OpenAI的内容审核API实现了内容过滤,该过滤会按照OpenAI的使用政策过滤掉有害内容。其次,我们按照Kirchenbauer等人2023年的研究所述方法,为所有模型输出添加了数字水印,以便他人能够(以一定概率)判断某一输出是否来自Alpaca 7B模型。最后,我们对该演示的使用设置了严格的条款与条件:仅允许非商业性使用,且需遵循LLaMA的许可协议。我们深知,一旦我们发布模型权重,或用户训练自己的指令遵循模型,这些缓解措施可能会被绕过。但我们希望通过安装这些措施,推动最佳实践的发展,并最终为基础模型的负责任部署建立社区规范。 ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 Alpaca数据集由语言模型(`text-davinci-003`)生成,不可避免地会包含一些错误或偏差。我们鼓励用户谨慎使用该数据集,并提出新的方法来过滤或改进这些不完善之处。 ## 附加信息 ### 数据集策展人 [需补充更多信息] ### 许可信息 本数据集采用[知识共享署名-非商业性使用4.0国际许可协议(CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/legalcode)发布。 ### 引用信息 @misc{alpaca, author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto }, title = {Stanford Alpaca: An Instruction-following LLaMA model}, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {url{https://github.com/tatsu-lab/stanford_alpaca}}, } ### 贡献 [需补充更多信息]
提供机构:
pss8093
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作