five

akoksal/muri-it

收藏
Hugging Face2024-12-17 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/akoksal/muri-it
下载链接
链接失效反馈
官方服务:
资源简介:
MURI-IT是一个大规模的多语言指令调优数据集,包含200种语言的220万条指令-输出对。该数据集通过多语言反向指令(MURI)确保输出是人工编写、高质量且符合源语言文化和语言特点的。数据集的主要步骤包括从CulturaX和Wikipedia提取高质量文本,将其翻译成英语,通过LLMs生成反向指令,并将指令翻译回源语言。数据集格式包括输入、输出、数据集名称、子数据集名称、语言代码、语言名称和数据集分割。数据集还提供了按语言和来源的详细统计信息。

MURI-IT is a large-scale multilingual instruction tuning dataset containing 2.2 million instruction-output pairs across 200 languages. It is designed to address the challenges of instruction tuning in low-resource languages with Multilingual Reverse Instructions (MURI), which ensures that the output is human-written, high-quality, and authentic to the cultural and linguistic nuances of the source language. The dataset is constructed by extracting high-quality human-written raw text from CulturaX and Wikipedia, translating it into English, applying reverse instructions to generate instructions for the raw text via LLMs, and then translating the generated instructions back into the source language. Each entry in the dataset consists of input instructions, output text, dataset name, subdataset name, language ISO 639-3 code, language name, and dataset split (train, validation, or test). The dataset sources include Multilingual Reverse Instructions, Wikipedia, CulturaX, WikiHow, NLP tasks, etc., totaling 200 languages and 2,228,499 examples.
提供机构:
akoksal
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作