Anh
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/laion/Anh
下载链接
链接失效反馈官方服务:
资源简介:
## Anh multilingual chat dataset
This is about 24M multilingual synthetic instructions intended to perform continued pretraining and finetuning a chatbot.
- cross_lingual.jsonl (~800000)
This dataset contains both the multi-lingual and cross-lingual version of the Anh data in the form of `Human: instruction\nAssistant: response` described here:
https://github.com/LAION-AI/Anh/tree/main/data .
The data is translated from a portion of the OIG dataset, which includes synthic_qa, prosocial and anthropic data. Read more about the data in LAION's OIG hf repo.
Covers these langs: zh, vi, ru, ms, pt, ja, id, hi, fr, es, de.
- xp3_sample.jsonl (~650000)
This dataset also contains a portion of the xp3 dataset converted into the standard Human/Assistant format.
See https://huggingface.co/datasets/bigscience/xP3 for the 43 languages covered by xp3.
- sungai_ul2_instructions.jsonl (~23000000)
This dataset also contains a UL2 like instruction set based on 140 languages from a subset of cc100, OSCAR and mc4.
You can find the individual datasets from which this UL2 version was created here: https://github.com/ontocord/sungai
## Disclaimer
- Translations may be inaccurate. The web text found in the UL2 file may contain inappropriate content as it is based on web scrapped data.
- Translations were generated by M2M 12B and the output generations were limited at 512 tokens due to VRAM limit (40G).
## License
The Anh dataset that is authored by LAION volunteers is released under an Apache 2.0 license. However, the data also includes content licensed under other permissive licenses, or web-crawled data which is used under fair use principles.
## Acknowledgement
- Thanks to LAION's Anh multilingual chat team: @yp_yurilee, @cahya, @kevin ko, @lasse, @mattdf, @theblackcat102, @yongzx, @acul3, @logus2, @paulovn, and many others.
- Thanks to @rallio67 for the original English version of the cross_lingual dataset.
- Thanks to @theblackcat102 for his translations at https://huggingface.co/datasets/theblackcat102/instruction_translations, from which the cross-lingual data is based.
- Thanks to the authors of all the underlying datasets from which Anh is based, including the xp3 authors, OSCAR, cc100 and mc4 authors.
Anh多语言聊天数据集
本数据集包含约2400万条多语言合成指令,旨在用于聊天机器人的持续预训练与微调。
- cross_lingual.jsonl(约80万条)
本数据集包含Anh数据的多语言与跨语言版本,格式为`Human: 指令
Assistant: 回复`,详细说明参见:https://github.com/LAION-AI/Anh/tree/main/data。
该数据集源自OIG数据集的部分子集,OIG数据集包含合成问答(synthic_qa)、亲社会(prosocial)与Anthropic相关数据。更多相关信息可查阅LAION的OIG Hugging Face仓库。
覆盖以下语言:中文(zh)、越南语(vi)、俄语(ru)、马来语(ms)、葡萄牙语(pt)、日语(ja)、印尼语(id)、印地语(hi)、法语(fr)、西班牙语(es)、德语(de)。
- xp3_sample.jsonl(约65万条)
本数据集还包含部分转换为标准Human/Assistant格式的xp3数据集子集。xp3数据集共覆盖43种语言,详情参见:https://huggingface.co/datasets/bigscience/xP3。
- sungai_ul2_instructions.jsonl(约2300万条)
本数据集还包含基于cc100、OSCAR与mc4子集的140种语言构建的类UL2指令集。该UL2版本所基于的各原始数据集可参见:https://github.com/ontocord/sungai。
## 免责声明
- 翻译可能存在不准确之处。由于UL2文件中的文本源自网络爬取数据,可能包含不当内容。
- 本次翻译由M2M 12B模型生成,且受限于40G显存限制,输出生成的Token最大长度被限制为512。
## 许可证
由LAION志愿者创作的Anh数据集采用Apache 2.0许可证发布。但本数据集还包含采用其他宽松许可证授权的内容,或是基于合理使用原则使用的网络爬取数据。
## 致谢
- 感谢LAION的Anh多语言聊天团队:@yp_yurilee、@cahya、@kevin ko、@lasse、@mattdf、@theblackcat102、@yongzx、@acul3、@logus2、@paulovn 以及其他众多贡献者。
- 感谢@rallio67 为跨语言数据集提供原始英文版本。
- 感谢@theblackcat102 在https://huggingface.co/datasets/theblackcat102/instruction_translations 发布的翻译成果,跨语言数据集即基于此构建。
- 感谢所有支撑Anh数据集的原始数据集作者,包括xp3数据集、OSCAR、cc100与mc4的作者团队。
提供机构:
maas
创建时间:
2025-10-15



