CC-100 monolingual dataset
收藏arXiv2024-07-02 更新2024-07-04 收录
下载链接:
https://huggingface.co/datasets/CohereForAI/aya
下载链接
链接失效反馈官方服务:
资源简介:
CC-100单语数据集由Zoom视频通信公司创建,旨在通过利用英语为中心的大型语言模型和单语语料库,生成高质量、多样化的多语言指令微调数据集。数据集包含约500,000条指令-响应对,涵盖泰卢固语、印地语、日语和西班牙语。创建过程中,通过筛选高质量文本片段并利用评分函数确保数据质量。该数据集主要应用于提升大型语言模型在非英语环境下的理解和生成能力,特别是在多语言摘要和机器翻译任务中。
The CC-100 monolingual dataset was developed by Zoom Video Communications, Inc. It is designed to generate high-quality, diverse multilingual instruction fine-tuning datasets by utilizing English-centric large language models and monolingual corpora. The dataset comprises approximately 500,000 instruction-response pairs covering Telugu, Hindi, Japanese, and Spanish. During the creation process, high-quality text segments were screened and a scoring function was employed to ensure data quality. This dataset is primarily applied to enhance the understanding and generation capabilities of large language models in non-English contexts, especially for multilingual summarization and machine translation tasks.
提供机构:
Zoom视频通信
创建时间:
2024-07-02
搜集汇总
数据集介绍

构建方式
CC-100 monolingual dataset的构建方法旨在解决多语言指令微调(IFT)数据集中语言不平衡的问题。该数据集通过利用以英语为中心的语言模型(LLM)、单语语料库和评分函数,创建了高质量、多样化的多语言IFT数据集。具体来说,该数据集的构建分为五个阶段:选择响应、将响应翻译成英语、生成英语指令、评分和将指令翻译回原始语言。首先,从每种非英语语言的单语语料库中提取文本片段,并使用英语LLM生成伪指令。然后,使用评分函数过滤和识别高质量的示例,以确保生成的数据集的质量和多样性。
特点
CC-100 monolingual dataset的特点在于其高质量、多样性和语言的自然性。该数据集通过利用英语LLM和单语语料库,能够捕获每种语言的独特语言和文化特点,从而提高了多语言应用中的性能和准确性。此外,评分函数确保了生成的IFT示例的质量,从而提高了数据集的有效性。实验结果表明,与使用基于翻译和模板的数据集进行微调的模型相比,使用该数据集进行微调的模型在生成和判别任务中都取得了显著的改进。
使用方法
使用CC-100 monolingual dataset的方法包括以下几个步骤:首先,从每种语言的单语语料库中选择响应。然后,将响应翻译成英语,并使用英语LLM生成指令。接下来,使用评分函数评估生成的指令的质量,并保留高于预定阈值的指令。最后,将英语指令翻译回原始语言,形成训练对。在微调阶段,LLM模型使用这些训练对进行训练,以生成与原始响应相对应的自然和准确的文本。这种方法有助于提高LLM模型在非英语环境中的语言理解能力,从而提高其整体性能。
背景与挑战
背景概述
在自然语言处理领域,大型语言模型(LLM)在指令遵循能力上取得了显著进展。然而,现有的指令微调(IFT)数据集主要集中在英语,这限制了模型在其他语言上的性能。为了解决这一问题,研究人员提出了一种新的方法,利用英语为中心的LLM、单语语料库和评分函数,创建了高质量、多样化的多语言IFT数据集。该方法通过利用英语为中心的LLM的能力和单语语料库的独特语言和文化特点,提高了模型在多语言应用中的性能和准确性。实验表明,使用这些IFT数据集微调的LLM在生成和判别任务中均取得了显著改进,表明LLM在非英语语境下的语言理解能力得到了提升。
当前挑战
构建多语言IFT数据集的挑战主要在于如何保持语言的天然性和指令的多样性。传统的创建多语言IFT数据集的方法,如将现有的英语IFT数据集进行翻译或通过模板将现有的NLP数据集转换为IFT数据集,难以捕捉语言的细微差别并确保指令的多样性。此外,翻译过程中可能会引入错误,导致在微调LLM时性能下降。而模板方法虽然避免了翻译错误,但难以实现指令的高多样性。为了解决这些问题,研究人员提出了一种新的方法,该方法利用英语为中心的LLM和每个非英语语言的单语语料库来创建高质量的IFT数据集。此外,还使用评分函数来控制生成的IFT示例的质量,以确保数据集的多样性和准确性。
常用场景
经典使用场景
CC-100 monolingual dataset is a multilingual dataset designed to enhance the instruction-following capabilities of Large Language Models (LLMs) across various languages. This dataset is particularly useful for researchers and developers who aim to fine-tune LLMs to understand and generate text in languages other than English. By leveraging English-focused LLMs and monolingual corpora, the dataset provides a diverse set of instruction-response pairs that capture the linguistic nuances of each language, thereby improving the performance of LLMs in non-English contexts.
解决学术问题
The CC-100 dataset addresses the significant disparity in Instruction Fine-Tuning (IFT) datasets, which are predominantly in English, limiting model performance in other languages. The dataset solves the problem of creating multilingual IFT datasets by preserving the nuances of languages and ensuring prompt diversity. This is achieved by leveraging English-focused LLMs and the availability of monolingual corpora in each non-English language. The dataset also employs a scoring function to control the quality of generated IFT examples. By relying on English-focused LLMs, the dataset taps into their extensive capabilities and transfers these abilities across diverse linguistic contexts. Utilizing monolingual corpora allows the dataset to capture the unique linguistic and cultural nuances of each language, enhancing performance and accuracy in multilingual applications. Additionally, the robust scoring function ensures that the knowledge and capabilities derived from English-centric LLMs are appropriately adapted and optimized for non-English languages.
衍生相关工作
The CC-100 dataset has inspired several related works in the field of multilingual NLP. For example, the Aya model, which is an instruction-tuned open-access multilingual language model, was developed using the dataset. The dataset has also been used to fine-tune other non-English focused LLMs, such as the Japanese-focused model RakutenAI-7B-Instrcut and the state-of-the-art multilingual LLM Aya-23. These works demonstrate the effectiveness of the CC-100 dataset in improving the performance of LLMs in non-English contexts.
以上内容由遇见数据集搜集并总结生成



