HiTZ/alpaca_mt

Name: HiTZ/alpaca_mt
Creator: HiTZ
Published: 2023-04-07 15:15:55
License: 暂无描述

Hugging Face2023-04-07 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/HiTZ/alpaca_mt

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - no-annotation language: - en - pt - es - ca - eu - gl - at language_creators: - machine-generated license: cc-by-nc-4.0 multilinguality: - multilingual - translation pretty_name: Alpaca MT size_categories: - 10K<n<100K source_datasets: - tatsu-lab/alpaca tags: - instruction-finetuning task_categories: - text-generation task_ids: - dialogue-modeling dataset_info: - config_name: en features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: prompt dtype: string splits: - name: train num_bytes: 32088854 num_examples: 51942 download_size: 22764890 dataset_size: 32088854 - config_name: pt features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: prompt dtype: string splits: - name: train num_bytes: 33600380 num_examples: 51942 download_size: 23513483 dataset_size: 33600380 - config_name: es features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: prompt dtype: string splits: - name: train num_bytes: 35893136 num_examples: 51942 download_size: 24483751 dataset_size: 35893136 - config_name: ca features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: prompt dtype: string splits: - name: train num_bytes: 33938638 num_examples: 51942 download_size: 23096222 dataset_size: 33938638 - config_name: eu features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: prompt dtype: string splits: - name: train num_bytes: 29977672 num_examples: 51942 download_size: 20469814 dataset_size: 29977672 - config_name: gl features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: prompt dtype: string splits: - name: train num_bytes: 32736710 num_examples: 51942 download_size: 22356802 dataset_size: 32736710 - config_name: at features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: prompt dtype: string splits: - name: train num_bytes: 31487842 num_examples: 51942 download_size: 20688305 dataset_size: 31487842 --- # Dataset Card for Alpaca MT ## Dataset Description - **Homepage:** https://crfm.stanford.edu/2023/03/13/alpaca.html - **Repository:** https://github.com/juletx/alpaca-lora-mt - **Paper:** - **Leaderboard:** - **Point of Contact:** Rohan Taori ### Dataset Summary Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's `text-davinci-003` engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. This dataset also includes machine-translated data for 6 Iberian languages: Portuguese, Spanish, Catalan, Basque, Galician and Asturian. Translation was done using NLLB-200 3.3B model. The authors built on the data generation pipeline from [Self-Instruct framework](https://github.com/yizhongw/self-instruct) and made the following modifications: - The `text-davinci-003` engine to generate the instruction data instead of `davinci`. - A [new prompt](https://github.com/tatsu-lab/stanford_alpaca/blob/main/prompt.txt) was written that explicitly gave the requirement of instruction generation to `text-davinci-003`. - Much more aggressive batch decoding was used, i.e., generating 20 instructions at once, which significantly reduced the cost of data generation. - The data generation pipeline was simplified by discarding the difference between classification and non-classification instructions. - Only a single instance was generated for each instruction, instead of 2 to 3 instances as in Self-Instruct. This produced an instruction-following dataset with 52K examples obtained at a much lower cost (less than $500). In a preliminary study, the authors also found that the 52K generated data to be much more diverse than the data released by [Self-Instruct](https://github.com/yizhongw/self-instruct/blob/main/data/seed_tasks.jsonl). ### Supported Tasks and Leaderboards The Alpaca dataset designed for instruction training pretrained language models. ### Languages The original data in Alpaca is in English (BCP-47 en). We also provide machine-translated data for 6 Iberian languages: Portuguese (BCP-47 pt), Spanish (BCP-47 es), Catalan (BCP-47 ca), Basque (BCP-47 eu), Galician (BCP-47 gl) and Asturian (BCP-47 at). ## Dataset Structure ### Data Instances An example of "train" looks as follows: ```json { "instruction": "Create a classification task by clustering the given list of items.", "input": "Apples, oranges, bananas, strawberries, pineapples", "output": "Class 1: Apples, Oranges\nClass 2: Bananas, Strawberries\nClass 3: Pineapples", "text": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nCreate a classification task by clustering the given list of items.\n\n### Input:\nApples, oranges, bananas, strawberries, pineapples\n\n### Response:\nClass 1: Apples, Oranges\nClass 2: Bananas, Strawberries\nClass 3: Pineapples", } ``` ### Data Fields The data fields are as follows: * `instruction`: describes the task the model should perform. Each of the 52K instructions is unique. * `input`: optional context or input for the task. For example, when the instruction is "Summarize the following article", the input is the article. Around 40% of the examples have an input. * `output`: the answer to the instruction as generated by `text-davinci-003`. * `text`: the `instruction`, `input` and `output` formatted with the [prompt template](https://github.com/tatsu-lab/stanford_alpaca#data-release) used by the authors for fine-tuning their models. ### Data Splits | | train | |---------------|------:| | en | 52002 | | pt | 52002 | | es | 52002 | | ca | 52002 | | eu | 52002 | | gl | 52002 | | at | 52002 | ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset Excerpt the [blog post](https://crfm.stanford.edu/2023/03/13/alpaca.html) accompanying the release of this dataset: > We believe that releasing the above assets will enable the academic community to perform controlled scientific studies on instruction-following language models, resulting in better science and ultimately new techniques to address the existing deficiencies with these models. At the same time, any release carries some risk. First, we recognize that releasing our training recipe reveals the feasibility of certain capabilities. On one hand, this enables more people (including bad actors) to create models that could cause harm (either intentionally or not). On the other hand, this awareness might incentivize swift defensive action, especially from the academic community, now empowered by the means to perform deeper safety research on such models. Overall, we believe that the benefits for the research community outweigh the risks of this particular release. Given that we are releasing the training recipe, we believe that releasing the data, model weights, and training code incur minimal further risk, given the simplicity of the recipe. At the same time, releasing these assets has enormous benefits for reproducible science, so that the academic community can use standard datasets, models, and code to perform controlled comparisons and to explore extensions. Deploying an interactive demo for Alpaca also poses potential risks, such as more widely disseminating harmful content and lowering the barrier for spam, fraud, or disinformation. We have put into place two risk mitigation strategies. First, we have implemented a content filter using OpenAI’s content moderation API, which filters out harmful content as defined by OpenAI’s usage policies. Second, we watermark all the model outputs using the method described in Kirchenbauer et al. 2023, so that others can detect (with some probability) whether an output comes from Alpaca 7B. Finally, we have strict terms and conditions for using the demo; it is restricted to non-commercial uses and to uses that follow LLaMA’s license agreement. We understand that these mitigation measures can be circumvented once we release the model weights or if users train their own instruction-following models. However, by installing these mitigations, we hope to advance the best practices and ultimately develop community norms for the responsible deployment of foundation models. ### Discussion of Biases [More Information Needed] ### Other Known Limitations The `alpaca` data is generated by a language model (`text-davinci-003`) and inevitably contains some errors or biases. We encourage users to use this data with caution and propose new methods to filter or improve the imperfections. ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information The dataset is available under the [Creative Commons NonCommercial (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/legalcode). ### Citation Information ``` @misc{alpaca, author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto }, title = {Stanford Alpaca: An Instruction-following LLaMA model}, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/tatsu-lab/stanford_alpaca}}, } ``` ### Contributions [More Information Needed]

提供机构：

HiTZ

原始信息汇总

数据集概述

名称: Alpaca MT

语言: 英语 (en), 葡萄牙语 (pt), 西班牙语 (es), 加泰罗尼亚语 (ca), 巴斯克语 (eu), 加利西亚语 (gl), 阿斯图里亚语 (at)

语言创建方式: 机器生成

许可证: Creative Commons NonCommercial (CC BY-NC 4.0)

多语言性: 多语言, 翻译

大小: 10K<n<100K

源数据集: tatsu-lab/alpaca

标签: 指令微调

任务类别: 文本生成

任务ID: 对话建模

数据集结构

数据实例

字段:
- instruction: 描述模型应执行的任务。
- input: 任务的上下文或输入。
- output: text-davinci-003生成的任务答案。
- text: 使用作者用于微调模型的提示模板格式化的instruction, input和output。

数据分割

语言	训练
en	52002
pt	52002
es	52002
ca	52002
eu	52002
gl	52002
at	52002

数据集创建

数据生成: 使用text-davinci-003引擎生成指令数据。
翻译: 使用NLLB-200 3.3B模型进行6种伊比利亚语言的机器翻译。

使用考虑

社会影响: 数据集的发布旨在促进学术界对指令遵循语言模型的科学研究，同时认识到存在风险，已实施内容过滤和输出水印等风险缓解策略。
偏见与限制: 数据由语言模型生成，可能包含错误或偏见，建议用户谨慎使用并提出改进方法。

搜集汇总

数据集介绍

构建方式

在自然语言处理领域，指令微调数据集对于提升语言模型的指令遵循能力至关重要。Alpaca MT数据集的构建基于Self-Instruct框架，并进行了多项优化。研究团队采用OpenAI的text-davinci-003引擎生成指令数据，通过精心设计的提示词明确指令生成要求。在生成过程中，团队运用了更高效的批量解码策略，一次性生成20条指令，显著降低了数据生成成本。同时，简化了数据生成流程，摒弃了分类与非分类指令的区分，并为每条指令仅生成单一实例，最终以低于500美元的成本获得了包含52,000条示例的多样化指令跟随数据集。

使用方法

在模型训练与应用中，该数据集主要用于指令微调预训练语言模型。研究人员可通过HuggingFace平台直接加载不同语言配置的数据，利用其标准化的指令-输入-输出结构进行监督式微调。数据集中预设的提示模板可直接用于模型输入格式化，简化了训练流程。使用者需注意数据基于语言模型生成，可能存在错误或偏差，建议结合过滤方法谨慎使用。该数据集遵循CC BY-NC 4.0许可，适用于非商业研究，为多语言指令跟随模型的比较研究与扩展探索提供了标准化基准。

背景与挑战

背景概述

在自然语言处理领域，指令微调技术旨在提升语言模型遵循人类指令的能力，成为推动模型实用化的关键路径。Alpaca MT数据集由斯坦福大学研究团队于2023年构建，其核心研究问题聚焦于如何高效生成大规模、多样化的指令遵循数据，以降低模型训练成本并促进多语言场景下的适应性研究。该数据集基于Self-Instruct框架优化，利用text-davinci-003引擎生成了五万二千条英文指令数据，并扩展至葡萄牙语、西班牙语等六种伊比利亚语言，为开源社区探索指令微调机制提供了重要基础，显著影响了轻量化模型训练与跨语言迁移研究的发展方向。

当前挑战

Alpaca MT数据集面临的挑战主要体现在领域问题与构建过程两方面。在领域问题上，指令遵循任务需克服模型对复杂、开放域指令的理解与执行偏差，确保生成内容在逻辑性、安全性与多样性间的平衡，同时多语言扩展要求解决低资源语言的文化语境与语义等效性难题。构建过程中，依赖大语言模型自动生成数据引入了隐性错误与偏见传播的风险，且机器翻译可能损害指令的精确性与语言特性；此外，数据生成管道的简化虽降低成本，但也可能削弱任务覆盖的深度与实例的丰富性，为后续模型泛化能力带来潜在局限。

常用场景

经典使用场景

在自然语言处理领域，指令微调已成为提升大语言模型遵循人类意图能力的关键范式。Alpaca MT数据集以其多语言特性，为研究者提供了丰富的指令-输出对，尤其适用于训练模型理解和执行多样化任务。该数据集通过机器翻译扩展至六种伊比利亚语言，使得跨语言指令跟随研究成为可能，经典应用场景包括构建多语言对话系统、进行零样本或少样本学习实验，以及评估模型在跨语言环境下的泛化性能。

解决学术问题

Alpaca MT数据集有效应对了指令微调研究中高质量多语言数据稀缺的挑战。它通过低成本生成策略，提供了大规模、多样化的指令数据，解决了传统方法依赖昂贵人工标注或数据同质化的问题。该数据集促进了低资源语言环境下指令跟随模型的发展，为探索模型的多语言对齐能力、减少语言偏见以及提升跨文化适应性提供了实证基础，推动了开放科学和可重复研究在自然语言生成领域的进步。

实际应用

在实际部署中，Alpaca MT数据集支持开发能够理解并响应多语言用户指令的智能助手。例如，在教育技术领域，可基于该数据集训练辅助教学工具，为学生提供个性化、多语言的问答和解释服务。在客户服务场景中，模型能够处理葡萄牙语、西班牙语等不同语言的查询，提升跨区域服务的自动化水平。此外，该数据集也有助于构建内容创作工具，辅助用户以多种语言生成文本摘要、分类列表或创意写作。

数据集最近研究