five

Amirmarshal/PersianGPT

收藏
Hugging Face2023-09-04 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Amirmarshal/PersianGPT
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 language: - en tags: - instruction-finetuning pretty_name: Alpaca task_categories: - text-generation --- # Dataset Card for Alpaca ## Dataset Description - **Homepage:** https://crfm.stanford.edu/2023/03/13/alpaca.html - **Repository:** https://github.com/tatsu-lab/stanford_alpaca - **Paper:** - **Leaderboard:** - **Point of Contact:** Rohan Taori ### Dataset Summary Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's `text-davinci-003` engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The authors built on the data generation pipeline from [Self-Instruct framework](https://github.com/yizhongw/self-instruct) and made the following modifications: - The `text-davinci-003` engine to generate the instruction data instead of `davinci`. - A [new prompt](https://github.com/tatsu-lab/stanford_alpaca/blob/main/prompt.txt) was written that explicitly gave the requirement of instruction generation to `text-davinci-003`. - Much more aggressive batch decoding was used, i.e., generating 20 instructions at once, which significantly reduced the cost of data generation. - The data generation pipeline was simplified by discarding the difference between classification and non-classification instructions. - Only a single instance was generated for each instruction, instead of 2 to 3 instances as in Self-Instruct. This produced an instruction-following dataset with 52K examples obtained at a much lower cost (less than $500). In a preliminary study, the authors also found that the 52K generated data to be much more diverse than the data released by [Self-Instruct](https://github.com/yizhongw/self-instruct/blob/main/data/seed_tasks.jsonl). ### Supported Tasks and Leaderboards The Alpaca dataset designed for instruction training pretrained language models. ### Languages The data in Alpaca are in English (BCP-47 en). ## Dataset Structure ### Data Instances An example of "train" looks as follows: ```json { "instruction": "Create a classification task by clustering the given list of items.", "input": "Apples, oranges, bananas, strawberries, pineapples", "output": "Class 1: Apples, Oranges\nClass 2: Bananas, Strawberries\nClass 3: Pineapples", "text": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nCreate a classification task by clustering the given list of items.\n\n### Input:\nApples, oranges, bananas, strawberries, pineapples\n\n### Response:\nClass 1: Apples, Oranges\nClass 2: Bananas, Strawberries\nClass 3: Pineapples", } ``` ### Data Fields The data fields are as follows: * `instruction`: describes the task the model should perform. Each of the 52K instructions is unique. * `input`: optional context or input for the task. For example, when the instruction is "Summarize the following article", the input is the article. Around 40% of the examples have an input. * `output`: the answer to the instruction as generated by `text-davinci-003`. * `text`: the `instruction`, `input` and `output` formatted with the [prompt template](https://github.com/tatsu-lab/stanford_alpaca#data-release) used by the authors for fine-tuning their models. ### Data Splits | | train | |---------------|------:| | alpaca | 52002 | ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset Excerpt the [blog post](https://crfm.stanford.edu/2023/03/13/alpaca.html) accompanying the release of this dataset: > We believe that releasing the above assets will enable the academic community to perform controlled scientific studies on instruction-following language models, resulting in better science and ultimately new techniques to address the existing deficiencies with these models. At the same time, any release carries some risk. First, we recognize that releasing our training recipe reveals the feasibility of certain capabilities. On one hand, this enables more people (including bad actors) to create models that could cause harm (either intentionally or not). On the other hand, this awareness might incentivize swift defensive action, especially from the academic community, now empowered by the means to perform deeper safety research on such models. Overall, we believe that the benefits for the research community outweigh the risks of this particular release. Given that we are releasing the training recipe, we believe that releasing the data, model weights, and training code incur minimal further risk, given the simplicity of the recipe. At the same time, releasing these assets has enormous benefits for reproducible science, so that the academic community can use standard datasets, models, and code to perform controlled comparisons and to explore extensions. Deploying an interactive demo for Alpaca also poses potential risks, such as more widely disseminating harmful content and lowering the barrier for spam, fraud, or disinformation. We have put into place two risk mitigation strategies. First, we have implemented a content filter using OpenAI’s content moderation API, which filters out harmful content as defined by OpenAI’s usage policies. Second, we watermark all the model outputs using the method described in Kirchenbauer et al. 2023, so that others can detect (with some probability) whether an output comes from Alpaca 7B. Finally, we have strict terms and conditions for using the demo; it is restricted to non-commercial uses and to uses that follow LLaMA’s license agreement. We understand that these mitigation measures can be circumvented once we release the model weights or if users train their own instruction-following models. However, by installing these mitigations, we hope to advance the best practices and ultimately develop community norms for the responsible deployment of foundation models. ### Discussion of Biases [More Information Needed] ### Other Known Limitations The `alpaca` data is generated by a language model (`text-davinci-003`) and inevitably contains some errors or biases. We encourage users to use this data with caution and propose new methods to filter or improve the imperfections. ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information The dataset is available under the [Creative Commons NonCommercial (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/legalcode). ### Citation Information ``` @misc{alpaca, author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto }, title = {Stanford Alpaca: An Instruction-following LLaMA model}, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/tatsu-lab/stanford_alpaca}}, } ``` ### Contributions [More Information Needed]
提供机构:
Amirmarshal
原始信息汇总

数据集概述:Alpaca

数据集描述

  • 数据集名称: Alpaca
  • 数据集概要: Alpaca是一个包含52,000条指令和演示的数据集,由OpenAI的text-davinci-003引擎生成。该数据集主要用于语言模型的指令微调,以提高模型遵循指令的能力。
  • 语言: 英语(BCP-47 en)
  • 许可: 创意共享非商业性(CC BY-NC 4.0)

数据集结构

数据实例

  • 示例结构: json { "instruction": "Create a classification task by clustering the given list of items.", "input": "Apples, oranges, bananas, strawberries, pineapples", "output": "Class 1: Apples, Oranges Class 2: Bananas, Strawberries Class 3: Pineapples", "text": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

Instruction:

Create a classification task by clustering the given list of items.

Input:

Apples, oranges, bananas, strawberries, pineapples

Response:

Class 1: Apples, Oranges Class 2: Bananas, Strawberries Class 3: Pineapples", }

数据字段

  • instruction: 描述模型应执行的任务,每个指令都是唯一的。
  • input: 任务的额外上下文或输入,约40%的示例包含此字段。
  • output: text-davinci-003生成的指令答案。
  • text: 包含instructioninputoutput的格式化文本,用于模型微调。

数据分割

  • 训练集: 52002条记录

数据集创建

数据生成

  • 生成引擎: 使用text-davinci-003引擎生成指令数据。
  • 生成策略: 采用更积极的批量解码,一次生成20条指令,显著降低数据生成成本。
  • 数据简化: 简化了数据生成流程,去除了分类和非分类指令的区别。
  • 实例数量: 每个指令只生成一个实例。

数据集使用考虑

  • 社会影响: 数据集的发布旨在促进学术界对指令遵循语言模型的科学研究,但也存在风险,如模型能力的泄露可能被不良行为者利用。
  • 偏见与限制: 数据由语言模型生成,可能包含错误或偏见,建议用户谨慎使用并探索改进方法。

附加信息

  • 许可证: 数据集遵循创意共享非商业性许可证(CC BY-NC 4.0)。

  • 引用信息:

    @misc{alpaca, author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto }, title = {Stanford Alpaca: An Instruction-following LLaMA model}, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {url{https://github.com/tatsu-lab/stanford_alpaca}}, }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作