five

M2Lingual

收藏
魔搭社区2025-12-05 更新2025-02-01 收录
下载链接:
https://modelscope.cn/datasets/ServiceNow-AI/M2Lingual
下载链接
链接失效反馈
官方服务:
资源简介:
# M2Lingual A _**M**ulti-turn_ _**M**ultilingual_ dataset for Instruction Fine-tuning LLMs - [Link](https://arxiv.org/pdf/2406.16783) ## Dataset Summary The M2Lingual dataset is a comprehensive multi-turn multilingual resource designed to facilitate research and development in conversational AI. It encompasses a wide range of conversation scenarios across multiple languages, making it an invaluable asset for training, evaluating, and benchmarking conversational models. The dataset includes diverse tasks such as question answering, task completion, summarization and more. Each entry is annotated with information about the conversation's evolution, including task Evol type, multi-turn Evol type, prompts, and the number of turns. The M2Lingual dataset aims to bridge the gap in multi-turn multilingual conversational data, providing a robust foundation for building more inclusive and effective AI systems that can understand and engage in human-like conversations across languages. Here is a walkthrough example conversation, the first step involves selecting seeds, the second step evolves the seed sample, and the third step sets a multi-turn evolution which leads to a diverse multi-turn multilingual dataset: ![Walkthrough for dataset creation](https://huggingface.co/datasets/ServiceNow-AI/M2Lingual/resolve/main/walkthrough-fig.jpg) ## Dataset Instances Here is an example conversation: ``` { "id": 8605, "conversation": [ { "role": "user", "content": "Formulate an answer to this elaborate question: Identify the base location of the regional airline for which Jim French served as the former chairman and CEO, and explain the significance of this location in the context of the airline's regional operations." }, { "role": "assistant", "content": "Jim French served as the former chairman and CEO of Flybe, a British regional airline. The base location of Flybe was Exeter International Airport in Devon, England. \n\nThe significance of Exeter International Airport as the base location for Flybe lies in its strategic positioning within the United Kingdom. Exeter is situated in the southwest region of England, which allowed Flybe to efficiently connect various regional destinations across the UK and Europe. This location provided a central hub for operations, facilitating shorter flight routes and quick turnaround times, which are essential for regional airlines aiming to offer frequent and reliable services. Additionally, having a base in Exeter helped Flybe tap into a market with fewer competing airlines, allowing it to establish a strong regional presence and foster economic growth within the southwest of England by improving connectivity and accessibility." }, { "role": "user", "content": "Please adjust the output to provide the answer in a concise text-only format." }, { "role": "assistant", "content": "Jim French served as the former chairman and CEO of Flybe, a British regional airline based at Exeter International Airport in Devon, England. The significance of this location lies in its strategic positioning in the southwest of England, enabling efficient regional connectivity across the UK and Europe and fostering economic growth in the region." }, { "role": "user", "content": "Now, rephrase your response as if you were an enthusiastic travel blogger sharing fascinating details with your readers." }, { "role": "assistant", "content": "Hey travel enthusiasts! Did you know that Jim French, the former chairman and CEO of Flybe, once led this fantastic British regional airline based at Exeter International Airport in beautiful Devon, England? It's such a cool spot! Nestled in the southwest of England, Exeter\u2019s prime location made it perfect for Flybe to connect countless regional destinations across the UK and Europe. This strategic hub wasn\u2019t just about convenience; it played a crucial role in boosting regional connectivity and economic growth. So, next time you\u2019re flying through Exeter, remember the incredible role it played in the world of regional air travel! \ud83c\udf0d\u2708\ufe0f" } ], "language": "English", "task": "hotpotqa", "no_of_turns": 2, "evolved_multiturn_prompt": [ "Please adjust the output to provide the answer in a concise text-only format.", "Now, rephrase your response as if you were an enthusiastic travel blogger sharing fascinating details with your readers." ], "multiturn_evol_type": [ "ADJUST_OUTPUT_FORMAT", "PERSONA_REWRITING_FOLLOW_UP" ], "multiturn_evol_prompt": [ { "ADJUST_OUTPUT_FORMAT": "- The follow-up instruction must ask to adjust the output format as users may request specific formats for the output, such as text-only, summarized results, or structured data formats." }, { "PERSONA_REWRITING_FOLLOW_UP": "- The follow-up instruction must request a change in the persona of the response to the previous instruction." } ], "evolved_user_prompt": "Formulate an answer to this elaborate question: Identify the base location of the regional airline for which Jim French served as the former chairman and CEO, and explain the significance of this location in the context of the airline's regional operations.", "output_assistant_reply": "Jim French served as the former chairman and CEO of Flybe, a British regional airline. The base location of Flybe was Exeter International Airport in Devon, England. \n\nThe significance of Exeter International Airport as the base location for Flybe lies in its strategic positioning within the United Kingdom. Exeter is situated in the southwest region of England, which allowed Flybe to efficiently connect various regional destinations across the UK and Europe. This location provided a central hub for operations, facilitating shorter flight routes and quick turnaround times, which are essential for regional airlines aiming to offer frequent and reliable services. Additionally, having a base in Exeter helped Flybe tap into a market with fewer competing airlines, allowing it to establish a strong regional presence and foster economic growth within the southwest of England by improving connectivity and accessibility.", "seed_prompt": "Formulate an answer to this elaborate question: Where is the regional airline based that Jim French is the former chairman and CEO of?", "task_evol_prompt": "Given a prompt #Given Prompt# based upon the #Given Prompt# create a #New Prompt# in the same language by combining multiple facts thus making the question more complex and requiring combining multiple facts to answer correctly.\n#Given Prompt#:\nFormulate an answer to this elaborate question: Where is the regional airline based that Jim French is the former chairman and CEO of?", "task_evol_type": "COMBINE_FACTS" } ``` ## Data Fields “id”: Sample ID. “conversation”: The full conversation with user and assistant turns. “evolved_multiturn_prompt”: The generated user turns from GPT-4 after using its corresponding multiturn _Evol_. “evolved_user_prompt”: The generated instruction from GPT-4 after using its corresponding evol. “language”: Language of the conversation. “multiturn_evol_type”: The type of multiturn _Evol_ used to prompt GPT-4. “multiturn_evol_prompt”: Multiturn _Evol_ prompt to GPT-4. “no_of_turns”: Number of turns in the conversation. “output_assistant_reply”: GPT-4 output “sample_type”: Reflects whether the row is a seed sample, _Evol_ sample or a conversation. “seed_prompt”: The seed prompt from which the conversation was generated. “task_evol_prompt”: _Evol_ prompt to GPT-4. “task_evol_type”: The type of _Evol_ used to prompt GPT-4. “task”: The NLP task category of the seed prompt. ## Statistics The total number of data points is 182K IR pairs. The table shows more details, with Avg Inst and Response showing the average no of tokens: | **Dataset** | **Seed** | **Evoled** | **Multi-turn** | |------------------|:---------:|:-----------:|:---------------:| | Aya Dataset | 7000 | 37803 | 36969 | | Aya Collection | 7140 | 57145 | 34426 | | Total IR pairs | 14140 | 94948 | 71395 | | Avg Instruction | 49.6 | 107.71 | 356.81 | | Avg Response | 56.79 | 62.16 | 87.6 | ## Load with Datasets To load this dataset with Datasets, you'll need to install Datasets as pip install datasets --upgrade and then use the following code: ```py from datasets import load_dataset dataset = load_dataset("ServiceNow-AI/M2Lingual", "full_data") ``` In the above code snippet, "full_data" refers to the complete dataset, including seeds, evols and multi-turn conversations. You can load other subsets by specifying the name ('seed_data', 'seed_evol_data' or 'full_data') at the time of loading the dataset. Please adhere to the licenses specified for this dataset. ## Additional Information ### Authorship Publishing Organization: ServiceNow AI Industry Type: Tech Contact Details: https://www.servicenow.com/now-platform/generative-ai.html ### Intended use and License Our dataset is licensed through CC-by-NC-SA-4.0 license. More details on the license terms can be found here: [CC BY-NC-SA 4.0 Deed](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en) The dataset is intended to be used to improve model performance towards multilingual natural language understanding. Portions of the dataset pertaining to seeds from the Aya collection are licensed under the Apache License, Version 2.0. You may obtain a copy of the License at [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0.html). This portion contains some files that have been modified. You may not use this file except in compliance with the License. ### Dataset Version and Maintenance Maintenance Status: Actively Maintained Version Details:    Current version: 1.0    Last Update: 06/2024    First Release: 06/2024 ### Citation Info M2Lingual - [Paper link](https://arxiv.org/pdf/2406.16783) ``` @misc{maheshwary2024m2lingual, title={M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models}, author={Rishabh Maheshwary and Vikas Yadav and Hoang Nguyen and Khyati Mahajan and Sathwik Tejaswi Madhusudhan}, year={2024}, eprint={2406.16783}, archivePrefix={arXiv}, } ```

# M2Lingual 面向大语言模型(LLM)指令微调的多轮次多语言数据集 —— [论文链接](https://arxiv.org/pdf/2406.16783) ## 数据集概览 M2Lingual数据集是一款全面的多轮次多语言资源,专为推动对话式人工智能的研究与开发而打造。该数据集覆盖多语言场景下的各类对话范式,实为训练、评估与基准测试对话模型的宝贵资产。数据集涵盖问答、任务完成、摘要生成等多元自然语言处理任务,每条数据均标注了对话演化相关信息,包括任务演化类型、多轮演化类型、提示词及对话轮次。 M2Lingual数据集旨在填补多轮次多语言对话数据领域的研究空白,为构建更具包容性与高效性的人工智能系统提供坚实支撑,使此类系统能够跨语言理解并开展类人类对话。 以下为数据集构建流程示例:第一步选取种子样本,第二步对种子样本执行演化操作,第三步触发多轮演化流程,最终生成多样化的多轮次多语言数据集: ![数据集构建流程示意图](https://huggingface.co/datasets/ServiceNow-AI/M2Lingual/resolve/main/walkthrough-fig.jpg) ## 数据集实例 以下为一段对话示例: json { "id": 8605, "conversation": [ { "role": "user", "content": "Formulate an answer to this elaborate question: Identify the base location of the regional airline for which Jim French served as the former chairman and CEO, and explain the significance of this location in the context of the airline's regional operations." }, { "role": "assistant", "content": "Jim French served as the former chairman and CEO of Flybe, a British regional airline. The base location of Flybe was Exeter International Airport in Devon, England. The significance of Exeter International Airport as the base location for Flybe lies in its strategic positioning within the United Kingdom. Exeter is situated in the southwest region of England, which allowed Flybe to efficiently connect various regional destinations across the UK and Europe. This location provided a central hub for operations, facilitating shorter flight routes and quick turnaround times, which are essential for regional airlines aiming to offer frequent and reliable services. Additionally, having a base in Exeter helped Flybe tap into a market with fewer competing airlines, allowing it to establish a strong regional presence and foster economic growth within the southwest of England by improving connectivity and accessibility." }, { "role": "user", "content": "Please adjust the output to provide the answer in a concise text-only format." }, { "role": "assistant", "content": "Jim French served as the former chairman and CEO of Flybe, a British regional airline based at Exeter International Airport in Devon, England. The significance of this location lies in its strategic positioning in the southwest of England, enabling efficient regional connectivity across the UK and Europe and fostering economic growth in the region." }, { "role": "user", "content": "Now, rephrase your response as if you were an enthusiastic travel blogger sharing fascinating details with your readers." }, { "role": "assistant", "content": "Hey travel enthusiasts! Did you know that Jim French, the former chairman and CEO of Flybe, once led this fantastic British regional airline based at Exeter International Airport in beautiful Devon, England? It's such a cool spot! Nestled in the southwest of England, Exeter’s prime location made it perfect for Flybe to connect countless regional destinations across the UK and Europe. This strategic hub wasn’t just about convenience; it played a crucial role in boosting regional connectivity and economic growth. So, next time you’re flying through Exeter, remember the incredible role it played in the world of regional air travel! ud83cudf0du2708ufe0f" } ], "language": "English", "task": "hotpotqa", "no_of_turns": 2, "evolved_multiturn_prompt": [ "Please adjust the output to provide the answer in a concise text-only format.", "Now, rephrase your response as if you were an enthusiastic travel blogger sharing fascinating details with your readers." ], "multiturn_evol_type": [ "ADJUST_OUTPUT_FORMAT", "PERSONA_REWRITING_FOLLOW_UP" ], "multiturn_evol_prompt": [ { "ADJUST_OUTPUT_FORMAT": "- The follow-up instruction must ask to adjust the output format as users may request specific formats for the output, such as text-only, summarized results, or structured data formats." }, { "PERSONA_REWRITING_FOLLOW_UP": "- The follow-up instruction must request a change in the persona of the response to the previous instruction." } ], "evolved_user_prompt": "Formulate an answer to this elaborate question: Identify the base location of the regional airline for which Jim French served as the former chairman and CEO, and explain the significance of this location in the context of the airline's regional operations.", "output_assistant_reply": "Jim French served as the former chairman and CEO of Flybe, a British regional airline. The base location of Flybe was Exeter International Airport in Devon, England. The significance of Exeter International Airport as the base location for Flybe lies in its strategic positioning within the United Kingdom. Exeter is situated in the southwest region of England, which allowed Flybe to efficiently connect various regional destinations across the UK and Europe. This location provided a central hub for operations, facilitating shorter flight routes and quick turnaround times, which are essential for regional airlines aiming to offer frequent and reliable services. Additionally, having a base in Exeter helped Flybe tap into a market with fewer competing airlines, allowing it to establish a strong regional presence and foster economic growth within the southwest of England by improving connectivity and accessibility.", "seed_prompt": "Formulate an answer to this elaborate question: Where is the regional airline based that Jim French is the former chairman and CEO of?", "task_evol_prompt": "Given a prompt #Given Prompt# based upon the #Given Prompt# create a #New Prompt# in the same language by combining multiple facts thus making the question more complex and requiring combining multiple facts to answer correctly. #Given Prompt#: Formulate an answer to this elaborate question: Where is the regional airline based that Jim French is the former chairman and CEO of?", "task_evol_type": "COMBINE_FACTS" } ## 数据字段说明 - `id`: 样本唯一标识符 - `conversation`: 包含用户与助手轮次的完整对话序列 - `evolved_multiturn_prompt`: 经对应多轮演化操作后由GPT-4生成的用户指令序列 - `evolved_user_prompt`: 经对应演化操作后由GPT-4生成的用户指令 - `language`: 对话所使用的自然语言 - `multiturn_evol_type`: 用于引导GPT-4的多轮演化类型标签 - `multiturn_evol_prompt`: 用于引导GPT-4生成多轮对话的演化提示词 - `no_of_turns`: 对话总轮次 - `output_assistant_reply`: GPT-4生成的助手回复内容 - `sample_type`: 用于标识样本类型,包括种子样本、演化样本或对话样本 - `seed_prompt`: 生成该对话的初始种子提示词 - `task_evol_prompt`: 用于引导GPT-4生成演化任务的提示词 - `task_evol_type`: 用于引导GPT-4的任务演化类型标签 - `task`: 种子提示词对应的自然语言处理任务类别 ## 统计信息 本数据集总计包含182K个交互样本对。下表展示了细分维度的统计详情,其中Avg Instruction与Avg Response分别代表指令与回复的平均Token数: | **数据集** | **种子样本** | **演化样本** | **多轮对话样本** | |:---------------:|:----------:|:----------:|:--------------:| | Aya Dataset | 7000 | 37803 | 36969 | | Aya Collection | 7140 | 57145 | 34426 | | 总交互样本对 | 14140 | 94948 | 71395 | | 平均指令长度(Token) | 49.6 | 107.71 | 356.81 | | 平均回复长度(Token) | 56.79 | 62.16 | 87.6 | ## 使用Datasets库加载 若需使用Hugging Face Datasets库加载本数据集,请先通过以下命令安装并升级Datasets库: bash pip install datasets --upgrade 随后使用如下代码加载数据集: py from datasets import load_dataset dataset = load_dataset("ServiceNow-AI/M2Lingual", "full_data") > 注:上述代码中的`full_data`指代完整数据集,包含种子样本、演化样本及多轮对话样本。您可通过指定不同子集名称(`seed_data`、`seed_evol_data`或`full_data`)加载对应细分数据集。使用本数据集时请严格遵循其授权协议。 ## 附加信息 ### 出版方与联系方式 出版机构:ServiceNow AI 行业类别:科技 联系详情:[ServiceNow生成式AI平台页面](https://www.servicenow.com/now-platform/generative-ai.html) ### 使用意图与授权协议 本数据集采用CC BY-NC-SA 4.0授权协议,授权条款详情可查阅[CC BY-NC-SA 4.0协议说明](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en)。 本数据集旨在用于提升大语言模型在多语言自然语言理解任务中的性能表现。 本数据集内源自Aya Collection的种子样本部分采用Apache License 2.0协议授权,您可至[Apache License 2.0协议页面](https://www.apache.org/licenses/LICENSE-2.0.html)获取协议副本。该部分包含部分经过修改的文件,仅在符合协议条款的前提下方可使用。 ### 数据集版本与维护状态 维护状态:持续维护 当前版本:1.0 最后更新时间:2024年6月 首次发布时间:2024年6月 ### 引用信息 M2Lingual —— [论文链接](https://arxiv.org/pdf/2406.16783) bibtex @misc{maheshwary2024m2lingual, title={M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models}, author={Rishabh Maheshwary and Vikas Yadav and Hoang Nguyen and Khyati Mahajan and Sathwik Tejaswi Madhusudhan}, year={2024}, eprint={2406.16783}, archivePrefix={arXiv}, }
提供机构:
maas
创建时间:
2025-01-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作