five

Bactrian-X

收藏
魔搭社区2025-12-05 更新2025-03-22 收录
下载链接:
https://modelscope.cn/datasets/MBZUAI/Bactrian-X
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for "Bactrian-X" ## Table of Contents - [Dataset Description](#a-dataset-description) - [Dataset Summary](#dataset-summary) - [Languages](#languages) - [Dataset Structure](#b-dataset-structure) - [Data Fields](#data-fields) - [Data Instances](#data-instances) - [Data in 52 Languages](#data-in-52-languages) - [Dataset Creation](#c-dataset-creation) - [Considerations for Using the Data](#d-considerations-for-using-the-data) - [Additional Information](#e-additional-information) ## A. Dataset Description - **Homepage:** https://github.com/mbzuai-nlp/Bactrian-X - **Repository:** https://huggingface.co/datasets/MBZUAI/Bactrian-X - **Paper:** to-be-soon released ### Dataset Summary <h3 align="center"> <img src="https://raw.githubusercontent.com/fajri91/eval_picts/master/BactrianX_dataset.jpg" width="950" align="center"> </h3> The Bactrain-X dataset is a collection of 3.4M instruction-response pairs in 52 languages, that are obtained by translating 67K English instructions ([alpaca-52k](https://github.com/tatsu-lab/stanford_alpaca) + [dolly-15k](https://github.com/databrickslabs/dolly/tree/master/data)) into 51 languages using Google Translate API. The translated instructions are then fed to ChatGPT (`gpt-3.5-turbo`) to obtain its natural responses, resulting in 3.4M instruction-response pairs in 52 languages (52 languages x 67k instances = 3.4M instances). ### Languages We follow 52 languages of [mBART-50](https://arxiv.org/abs/2008.00401), with details can be found [below](#data-in-52-languages). ## B. Dataset Structure ### Data Fields The data fields are as follows: * `instruction`: describes the task the model should perform. Each of the 67K instructions is unique. * `input`: optional context or input for the task. For example, when the instruction is "Summarize the following article", the input is the article. Around 40% of the examples have an input. * `output`: the answer to the instruction as generated by `gpt-3.5-turbo`. * `id`: the unique index for each data piece in a specific language. The same `id` across splits corresponds to the same original English data. ### Data Instances An example in `en` looks as follows. ```json { "instruction": "What is the main idea of this text?", "input": "The robotics industry has become one of the fastest growing sectors in today’s economy. Robotics technology is being applied in various industries and applications, such as manufacturing, healthcare, and defense.", "output": "The main idea of the text is that the robotics industry is rapidly growing and its technology is being used in many different industries and applications.", "id": "alpaca-107" } ``` An example in `ar` looks as follows. ```json { "instruction": "قم بالوصول إلى إعدادات المتصفح الافتراضية على جهاز الكمبيوتر.", "input": "", "output": "لا يمكن لي كمساعد افتراضي القيام بهذه المهمة كونها تحتاج إلى الوصول إلى الجهاز المادي. يرجى الرجوع إلى دليل المستخدم للكمبيوتر الخاص بك أو البحث على الإنترنت لمعرفة كيفية الوصول إلى إعدادات المتصفح الافتراضية على جهاز الكمبيوتر الخاص بك.", "id": "alpaca-41997" } ``` An example in `id` looks as follows. ```json { "instruction": "Berapa banyak angka penting yang harus digunakan untuk nilai 0,000187?", "input": "", "output": "Ada tiga angka penting dalam nilai 0,000187.", "id": "alpaca-40382" } ``` An example in `de` looks as follows. ```json { "instruction": "Komponieren Sie ein Musikstück mit einem Titel, der Trauer bezeichnet.", "input": "", "output": ""Schmerzendes Herz"\n\nDas Stück beginnt mit einer langsamen, traurigen Melodie, gespielt von einem Solo-Violoncello. Die Harmonien, die von einem Klavier begleitet werden, sind durcheinander und dissonant, um ein Gefühl von Schmerz und Trauer zu vermitteln.\n\nIm Laufe des Stücks treten andere Instrumente ein, wie Streicher, Flöte und Oboe, um die Melodie zu unterstützen und zu intensivieren. Der Rhythmus bleibt langsam und schwerfällig, um die Schwere des Schmerzes darzustellen.\n\nIn der Mitte des Stücks gibt es einen Höhepunkt, wo die Melodie und Harmonien sich verdichten und ein Gefühl von Überwältigung erzeugen. Der Solo-Violoncello spielt eine traurige und schnelle Phrase, die die Verzweiflung und Isolation des Protagonisten ausdrückt.\n\nDas Stück endet mit einer Wiederholung der ersten Melodie, gespielt von einem Solo-Klavier, während die anderen Instrumente auslaufen. Die Harmonien werden jedoch weicher und sanfter, um einen Hauch von Hoffnung zu vermitteln, dass der Schmerz langsam vergeht.", "id": "alpaca-33619" } ``` An example in `fr` looks as follows. ```json { "instruction": "Les Ceanothus ont-ils des racines peu profondes ?", "input": "", "output": "En tant qu'IA, je ne peux pas répondre avec une certitude absolue, mais voici des informations générales sur les racines de Ceanothus : \n\nLes Ceanothus ont tendance à avoir un système racinaire étendu et peu profond. Les racines peuvent s'étaler jusqu'à 4 à 6 fois la taille de la plante et s'étendre à environ 60 à 90 cm de profondeur. Cependant, cela peut varier en fonction du type de sol et de la variété de Ceanothus. Certaines variétés de Ceanothus pourraient avoir des racines plus profondes si les conditions s'y prêtent.", "id": "dolly-10940" } ``` ### Data in 52 Languages | No | Languages | Code | Train | | ---|---------------- | ----- | ----- | | 1 | [Afrikaans](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/af) | af_ZA | 67017 | | 2 | [Arabic](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/ar) | ar_AR | 67017 | | 3 | [Azerbaijani](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/az) | az_AZ | 67017 | | 4 | [Bengali](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/bn) | bn_IN | 67017 | | 5 | [Czech](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/cs) | cs_CZ | 67017 | | 6 | [German](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/de) | de_DE | 67017 | | 7 | [English](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/en) | en_XX | 67017 | | 8 | [Spanish](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/es) | es_XX | 67017 | | 9 | [Estonian](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/et) | et_EE | 67017 | | 10 | [Persian](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/fa) | fa_IR | 67017 | | 11 | [Finnish](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/fi) | fi_FI | 67017 | | 12 | [French](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/fr) | fr_XX | 67017 | | 13 | [Galician](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/gl) | gl_ES | 67017 | | 14 | [Gujarati](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/gu) | gu_IN | 67017 | | 15 | [Hebrew](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/he) | he_IL | 67017 | | 16 | [Hindi](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/hi) | hi_IN | 67017 | | 17 | [Croatian](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/hr) | hr_HR | 67017 | | 18 | [Indonesian](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/id) | id_ID | 67017 | | 19 | [Italian](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/it) | it_IT | 67017 | | 20 | [Japanese](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/ja) | ja_XX | 67017 | | 21 | [Georgian](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/ka) | ka_GE | 67017 | | 22 | [Kazakh](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/kk) | kk_KZ | 67017 | | 23 | [Khmer](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/km) | km_KH | 67017 | | 24 | [Korean](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/ko) | ko_KR | 67017 | | 25 | [Lithuanian](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/lt) | lt_LT | 67017 | | 26 | [Latvian](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/lv) | lv_LV | 67017 | | 27 | [Macedonian](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/mk) | mk_MK | 67017 | | 28 | [Malayalam](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/ml) | ml_IN | 67017 | | 29 | [Mongolian](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/mn) | mn_MN | 67017 | | 30 | [Marathi](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/mr) | mr_IN | 67017 | | 31 | [Burmese](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/my) | my_MM | 67017 | | 32 | [Nepali](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/ne) | ne_NP | 67017 | | 33 | [Dutch](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/nl) | nl_XX | 67017 | | 34 | [Polish](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/pl) | pl_PL | 67017 | | 35 | [Pashto](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/ps) | ps_AF | 67017 | | 36 | [Portuguese](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/pt) | pt_XX | 67017 | | 37 | [Romanian](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/ro) | ro_RO | 67017 | | 38 | [Russian](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/ru) | ru_RU | 67017 | | 39 | [Sinhala](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/si) | si_LK | 67017 | | 40 | [Slovene](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/sl) | sl_SI | 67017 | | 41 | [Swedish](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/sv) | sv_SE | 67017 | | 42 | [Swahili](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/sw) | sw_KE | 67017 | | 43 | [Tamil](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/ta) | ta_IN | 67017 | | 44 | [Telugu](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/te) | te_IN | 67017 | | 45 | [Thai](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/th) | th_TH | 67017 | | 46 | [Tagalog](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/tl) | tl_XX | 67017 | | 47 | [Turkish](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/tr) | tr_TR | 67017 | | 48 | [Ukrainian](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/uk) | uk_UA | 67017 | | 49 | [Urdu](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/ur) | ur_PK | 67017 | | 50 | [Vietnamese](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/vi) | vi_VN | 67017 | | 51 | [Xhosa](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/xh) | xh_ZA | 67017 | | 52 | [Chinese](https://huggingface.co/datasets/MBZUAI/Bactrian-X/viewer/zh) | zh_CN | 67017 | ## C. Dataset Creation 1. English Instructions: The English instuctions are obtained from [alpaca-53k](https://github.com/tatsu-lab/stanford_alpaca), and [dolly-15k](https://github.com/databrickslabs/dolly/tree/master/data). 2. Instruction Translation: The instructions (and inputs) are translated into 51 languages using Google Translation API (conducted on April 2023). 3. Output Generation: We generate output from `gpt-3.5-turbo` for each language (conducted on April 2023). ## D. Considerations for Using the Data ### Social Impact of Dataset NLP for everyone: this dataset helps to democratize the cutting-edge instruction-following models in 52 languages. This dataset also allows the first experiment on the multilingual LoRA-based LLaMA model. ### Discussion of Biases (1) Translation bias; (2) Potential English-culture bias in the translated dataset. ### Other Known Limitations The `Bactrian-X` data is generated by a language model (`gpt-3.5-turbo`) and inevitably contains some errors or biases. We encourage users to use this data with caution and propose new methods to filter or improve the imperfections. ## E. Additional Information ### Dataset Curators [Haonan Li](https://haonan-li.github.io/) and [Fajri Koto](http://www.fajrikoto.com) ### Licensing Information The dataset is available under the [Creative Commons NonCommercial (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/legalcode). ### Citation Information ``` @misc{li2023bactrianx, title={Bactrian-X : A Multilingual Replicable Instruction-Following Model with Low-Rank Adaptation}, author={Haonan Li and Fajri Koto and Minghao Wu and Alham Fikri Aji and Timothy Baldwin}, year={2023}, eprint={2305.15011}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ### Contributions Thanks to [@haonan-li](https://github.com/haonan-li), [@fajri91](https://github.com/fajri91) for adding this dataset.

# 「双峰驼-X(Bactrian-X)」数据集卡片 ## 目录 - [数据集描述](#a-dataset-description) - [数据集概览](#dataset-summary) - [支持语言](#languages) - [数据集结构](#b-dataset-structure) - [数据字段](#data-fields) - [数据样例](#data-instances) - [52种语言覆盖数据](#data-in-52-languages) - [数据集构建](#c-dataset-creation) - [数据集使用注意事项](#d-considerations-for-using-the-data) - [附加信息](#e-additional-information) ## A. 数据集描述 - **项目主页**:https://github.com/mbzuai-nlp/Bactrian-X - **代码仓库**:https://huggingface.co/datasets/MBZUAI/Bactrian-X - **论文**:即将发布 ### 数据集概览 <h3 align="center"> <img src="https://raw.githubusercontent.com/fajri91/eval_picts/master/BactrianX_dataset.jpg" width="950" align="center"> </h3> 双峰驼-X(Bactrian-X)数据集包含52种语言下的340万条指令-回复对,其构建方式为:首先将6.7万条英文指令(源自[alpaca-52k](https://github.com/tatsu-lab/stanford_alpaca)与[dolly-15k](https://github.com/databrickslabs/dolly/tree/master/data))通过Google翻译API翻译为51种语言,随后将翻译后的指令输入ChatGPT(`gpt-3.5-turbo`)以生成自然语言回复,最终得到52种语言的340万条指令-回复对(52种语言 × 6.7万条实例 = 340万条实例)。 ### 支持语言 本次数据集遵循多语言BART-50(mBART-50)的52种语言体系,具体语言列表详见下文的[52种语言覆盖数据](#data-in-52-languages)小节。 ## B. 数据集结构 ### 数据字段 本数据集包含以下字段: * `instruction`:描述模型需执行的任务,6.7万条指令均为唯一标识。 * `input`:任务的可选上下文或输入。例如当指令为“总结以下文章”时,`input`即为该文章。约40%的样本包含该字段。 * `output`:由`gpt-3.5-turbo`生成的指令回复。 * `id`:特定语言下每条数据的唯一索引,不同拆分中相同`id`对应同一条原始英文数据。 ### 数据样例 以下为各语言的样例: #### 英语(en)样例 json { "instruction": "该文本的核心思想是什么?", "input": "机器人技术行业已成为当今经济增长最快的领域之一。机器人技术已被应用于制造业、医疗保健和国防等多个行业和场景。", "output": "该文本的核心思想是机器人技术行业正快速发展,其技术已被应用于众多不同的行业和场景。", "id": "alpaca-107" } #### 阿拉伯语(ar)样例 json { "instruction": "访问计算机上的默认浏览器设置。", "input": "", "output": "作为虚拟助手,我无法执行此操作,因为它需要访问物理设备。请参阅您的计算机用户手册或在互联网上搜索如何访问计算机上的默认浏览器设置。", "id": "alpaca-41997" } #### 印度尼西亚语(id)样例 json { "instruction": "数值0.000187应使用几位有效数字?", "input": "", "output": "数值0.000187包含3位有效数字。", "id": "alpaca-40382" } #### 德语(de)样例 json { "instruction": "创作一首以“悲伤”为主题的音乐作品。", "input": "", "output": ""心碎" 作品以独奏大提琴演奏的缓慢悲伤旋律开篇。由钢琴伴奏的和声杂乱且不和谐,用以传递痛苦与悲伤的情绪。 随着乐曲推进,其他乐器如弦乐、长笛和双簧管陆续加入,以支撑并强化旋律。节奏始终缓慢沉重,用以凸显痛苦的沉重感。 在乐曲中部,旋律与和声达到高潮,营造出压抑感。独奏大提琴演奏出悲伤且急促的乐句,表达了主角的绝望与孤独。 乐曲最后以独奏钢琴演奏的开篇旋律收尾,其余乐器逐渐淡出。但和声变得愈发柔和,传递出痛苦终将慢慢消散的一丝希望。", "id": "alpaca-33619" } #### 法语(fr)样例 json { "instruction": "ceanothus属植物的根系是否较浅?", "input": "", "output": "作为人工智能,我无法给出绝对准确的答案,但以下是关于ceanothus属植物根系的通用信息: ceanothus属植物通常拥有扩展且较浅的根系,其根系可延展至植株冠幅的4至6倍宽,深度约为60至90厘米。不过这一特性会因土壤类型与品种的不同而有所变化。若环境条件适宜,部分品种的ceanothus可能会拥有更深的根系。", "id": "dolly-10940" } ### 52种语言覆盖数据 | 序号 | 语言名称 | 语言代码 | 训练样本数 | | ---|---------------- | ----- | ----- | | 1 | 南非荷兰语 | af_ZA | 67017 | | 2 | 阿拉伯语 | ar_AR | 67017 | | 3 | 阿塞拜疆语 | az_AZ | 67017 | | 4 | 孟加拉语 | bn_IN | 67017 | | 5 | 捷克语 | cs_CZ | 67017 | | 6 | 德语 | de_DE | 67017 | | 7 | 英语 | en_XX | 67017 | | 8 | 西班牙语 | es_XX | 67017 | | 9 | 爱沙尼亚语 | et_EE | 67017 | | 10 | 波斯语 | fa_IR | 67017 | | 11 | 芬兰语 | fi_FI | 67017 | | 12 | 法语 | fr_XX | 67017 | | 13 | 加利西亚语 | gl_ES | 67017 | | 14 | 古吉拉特语 | gu_IN | 67017 | | 15 | 希伯来语 | he_IL | 67017 | | 16 | 印地语 | hi_IN | 67017 | | 17 | 克罗地亚语 | hr_HR | 67017 | | 18 | 印度尼西亚语 | id_ID | 67017 | | 19 | 意大利语 | it_IT | 67017 | | 20 | 日语 | ja_XX | 67017 | | 21 | 格鲁吉亚语 | ka_GE | 67017 | | 22 | 哈萨克语 | kk_KZ | 67017 | | 23 | 高棉语 | km_KH | 67017 | | 24 | 韩语 | ko_KR | 67017 | | 25 | 立陶宛语 | lt_LT | 67017 | | 26 | 拉脱维亚语 | lv_LV | 67017 | | 27 | 马其顿语 | mk_MK | 67017 | | 28 | 马拉雅拉姆语 | ml_IN | 67017 | | 29 | 蒙古语 | mn_MN | 67017 | | 30 | 马拉地语 | mr_IN | 67017 | | 31 | 缅甸语 | my_MM | 67017 | | 32 | 尼泊尔语 | ne_NP | 67017 | | 33 | 荷兰语 | nl_XX | 67017 | | 34 | 波兰语 | pl_PL | 67017 | | 35 | 普什图语 | ps_AF | 67017 | | 36 | 葡萄牙语 | pt_XX | 67017 | | 37 | 罗马尼亚语 | ro_RO | 67017 | | 38 | 俄语 | ru_RU | 67017 | | 39 | 僧伽罗语 | si_LK | 67017 | | 40 | 斯洛文尼亚语 | sl_SI | 67017 | | 41 | 瑞典语 | sv_SE | 67017 | | 42 | 斯瓦希里语 | sw_KE | 67017 | | 43 | 泰米尔语 | ta_IN | 67017 | | 44 | 泰卢固语 | te_IN | 67017 | | 45 | 泰语 | th_TH | 67017 | | 46 | 他加禄语 | tl_XX | 67017 | | 47 | 土耳其语 | tr_TR | 67017 | | 48 | 乌克兰语 | uk_UA | 67017 | | 49 | 乌尔都语 | ur_PK | 67017 | | 50 | 越南语 | vi_VN | 67017 | | 51 | 科萨语 | xh_ZA | 67017 | | 52 | 汉语(简体) | zh_CN | 67017 | ## C. 数据集构建 1. 英文指令集:英文指令源自[alpaca-52k](https://github.com/tatsu-lab/stanford_alpaca)与[dolly-15k](https://github.com/databrickslabs/dolly/tree/master/data)。 2. 指令翻译:通过Google翻译API将指令(含输入字段)翻译为51种语言(2023年4月完成)。 3. 回复生成:为每种语言通过`gpt-3.5-turbo`生成回复(2023年4月完成)。 ## D. 数据集使用注意事项 ### 数据集的社会影响 普惠自然语言处理:本数据集助力将前沿的指令跟随模型推广至52种语言,同时首次实现了基于低秩适配(Low-Rank Adaptation, LoRA)的多语言大语言模型(Large Language Model, LLM)实验。 ### 偏差说明 (1) 翻译偏差;(2) 翻译后数据集可能存在英语文化偏差。 ### 已知其他局限性 双峰驼-X(Bactrian-X)数据集由语言模型`gpt-3.5-turbo`生成,不可避免地存在部分错误或偏差。我们建议用户谨慎使用该数据集,并探索新方法以过滤或改进其中的不完善之处。 ## E. 附加信息 ### 数据集维护者 [Haonan Li](https://haonan-li.github.io/) 与 [Fajri Koto](http://www.fajrikoto.com) ### 许可协议 本数据集采用知识共享非商业许可(Creative Commons NonCommercial 4.0, CC BY-NC 4.0)。 ### 引用信息 @misc{li2023bactrianx, title={Bactrian-X : A Multilingual Replicable Instruction-Following Model with Low-Rank Adaptation}, author={Haonan Li and Fajri Koto and Minghao Wu and Alham Fikri Aji and Timothy Baldwin}, year={2023}, eprint={2305.15011}, archivePrefix={arXiv}, primaryClass={cs.CL} } ### 贡献致谢 感谢 [@haonan-li](https://github.com/haonan-li)、[@fajri91](https://github.com/fajri91) 为本数据集添加的贡献。
提供机构:
maas
创建时间:
2025-03-17
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作