Anh

Name: Anh
Creator: maas
Published: 2025-12-05 16:54:47
License: 暂无描述

魔搭社区2025-12-05 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/laion/Anh

下载链接

链接失效反馈

官方服务：

资源简介：

## Anh multilingual chat dataset This is about 24M multilingual synthetic instructions intended to perform continued pretraining and finetuning a chatbot. - cross_lingual.jsonl (~800000) This dataset contains both the multi-lingual and cross-lingual version of the Anh data in the form of `Human: instruction\nAssistant: response` described here: https://github.com/LAION-AI/Anh/tree/main/data . The data is translated from a portion of the OIG dataset, which includes synthic_qa, prosocial and anthropic data. Read more about the data in LAION's OIG hf repo. Covers these langs: zh, vi, ru, ms, pt, ja, id, hi, fr, es, de. - xp3_sample.jsonl (~650000) This dataset also contains a portion of the xp3 dataset converted into the standard Human/Assistant format. See https://huggingface.co/datasets/bigscience/xP3 for the 43 languages covered by xp3. - sungai_ul2_instructions.jsonl (~23000000) This dataset also contains a UL2 like instruction set based on 140 languages from a subset of cc100, OSCAR and mc4. You can find the individual datasets from which this UL2 version was created here: https://github.com/ontocord/sungai ## Disclaimer - Translations may be inaccurate. The web text found in the UL2 file may contain inappropriate content as it is based on web scrapped data. - Translations were generated by M2M 12B and the output generations were limited at 512 tokens due to VRAM limit (40G). ## License The Anh dataset that is authored by LAION volunteers is released under an Apache 2.0 license. However, the data also includes content licensed under other permissive licenses, or web-crawled data which is used under fair use principles. ## Acknowledgement - Thanks to LAION's Anh multilingual chat team: @yp_yurilee, @cahya, @kevin ko, @lasse, @mattdf, @theblackcat102, @yongzx, @acul3, @logus2, @paulovn, and many others. - Thanks to @rallio67 for the original English version of the cross_lingual dataset. - Thanks to @theblackcat102 for his translations at https://huggingface.co/datasets/theblackcat102/instruction_translations, from which the cross-lingual data is based. - Thanks to the authors of all the underlying datasets from which Anh is based, including the xp3 authors, OSCAR, cc100 and mc4 authors.

Anh多语言聊天数据集本数据集包含约2400万条多语言合成指令，旨在用于聊天机器人的持续预训练与微调。 - cross_lingual.jsonl（约80万条）本数据集包含Anh数据的多语言与跨语言版本，格式为`Human: 指令 Assistant: 回复`，详细说明参见：https://github.com/LAION-AI/Anh/tree/main/data。该数据集源自OIG数据集的部分子集，OIG数据集包含合成问答（synthic_qa）、亲社会（prosocial）与Anthropic相关数据。更多相关信息可查阅LAION的OIG Hugging Face仓库。覆盖以下语言：中文（zh）、越南语（vi）、俄语（ru）、马来语（ms）、葡萄牙语（pt）、日语（ja）、印尼语（id）、印地语（hi）、法语（fr）、西班牙语（es）、德语（de）。 - xp3_sample.jsonl（约65万条）本数据集还包含部分转换为标准Human/Assistant格式的xp3数据集子集。xp3数据集共覆盖43种语言，详情参见：https://huggingface.co/datasets/bigscience/xP3。 - sungai_ul2_instructions.jsonl（约2300万条）本数据集还包含基于cc100、OSCAR与mc4子集的140种语言构建的类UL2指令集。该UL2版本所基于的各原始数据集可参见：https://github.com/ontocord/sungai。 ## 免责声明 - 翻译可能存在不准确之处。由于UL2文件中的文本源自网络爬取数据，可能包含不当内容。 - 本次翻译由M2M 12B模型生成，且受限于40G显存限制，输出生成的Token最大长度被限制为512。 ## 许可证由LAION志愿者创作的Anh数据集采用Apache 2.0许可证发布。但本数据集还包含采用其他宽松许可证授权的内容，或是基于合理使用原则使用的网络爬取数据。 ## 致谢 - 感谢LAION的Anh多语言聊天团队：@yp_yurilee、@cahya、@kevin ko、@lasse、@mattdf、@theblackcat102、@yongzx、@acul3、@logus2、@paulovn 以及其他众多贡献者。 - 感谢@rallio67 为跨语言数据集提供原始英文版本。 - 感谢@theblackcat102 在https://huggingface.co/datasets/theblackcat102/instruction_translations 发布的翻译成果，跨语言数据集即基于此构建。 - 感谢所有支撑Anh数据集的原始数据集作者，包括xp3数据集、OSCAR、cc100与mc4的作者团队。

提供机构：

maas

创建时间：

2025-10-15

5,000+

优质数据集

54 个

任务类型

进入经典数据集