five

Anh

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/laion/Anh
下载链接
链接失效反馈
官方服务:
资源简介:
## Anh multilingual chat dataset This is about 24M multilingual synthetic instructions intended to perform continued pretraining and finetuning a chatbot. - cross_lingual.jsonl (~800000) This dataset contains both the multi-lingual and cross-lingual version of the Anh data in the form of `Human: instruction\nAssistant: response` described here: https://github.com/LAION-AI/Anh/tree/main/data . The data is translated from a portion of the OIG dataset, which includes synthic_qa, prosocial and anthropic data. Read more about the data in LAION's OIG hf repo. Covers these langs: zh, vi, ru, ms, pt, ja, id, hi, fr, es, de. - xp3_sample.jsonl (~650000) This dataset also contains a portion of the xp3 dataset converted into the standard Human/Assistant format. See https://huggingface.co/datasets/bigscience/xP3 for the 43 languages covered by xp3. - sungai_ul2_instructions.jsonl (~23000000) This dataset also contains a UL2 like instruction set based on 140 languages from a subset of cc100, OSCAR and mc4. You can find the individual datasets from which this UL2 version was created here: https://github.com/ontocord/sungai ## Disclaimer - Translations may be inaccurate. The web text found in the UL2 file may contain inappropriate content as it is based on web scrapped data. - Translations were generated by M2M 12B and the output generations were limited at 512 tokens due to VRAM limit (40G). ## License The Anh dataset that is authored by LAION volunteers is released under an Apache 2.0 license. However, the data also includes content licensed under other permissive licenses, or web-crawled data which is used under fair use principles. ## Acknowledgement - Thanks to LAION's Anh multilingual chat team: @yp_yurilee, @cahya, @kevin ko, @lasse, @mattdf, @theblackcat102, @yongzx, @acul3, @logus2, @paulovn, and many others. - Thanks to @rallio67 for the original English version of the cross_lingual dataset. - Thanks to @theblackcat102 for his translations at https://huggingface.co/datasets/theblackcat102/instruction_translations, from which the cross-lingual data is based. - Thanks to the authors of all the underlying datasets from which Anh is based, including the xp3 authors, OSCAR, cc100 and mc4 authors.

Anh多语言聊天数据集 本数据集包含约2400万条多语言合成指令,旨在用于聊天机器人的持续预训练与微调。 - cross_lingual.jsonl(约80万条) 本数据集包含Anh数据的多语言与跨语言版本,格式为`Human: 指令 Assistant: 回复`,详细说明参见:https://github.com/LAION-AI/Anh/tree/main/data。 该数据集源自OIG数据集的部分子集,OIG数据集包含合成问答(synthic_qa)、亲社会(prosocial)与Anthropic相关数据。更多相关信息可查阅LAION的OIG Hugging Face仓库。 覆盖以下语言:中文(zh)、越南语(vi)、俄语(ru)、马来语(ms)、葡萄牙语(pt)、日语(ja)、印尼语(id)、印地语(hi)、法语(fr)、西班牙语(es)、德语(de)。 - xp3_sample.jsonl(约65万条) 本数据集还包含部分转换为标准Human/Assistant格式的xp3数据集子集。xp3数据集共覆盖43种语言,详情参见:https://huggingface.co/datasets/bigscience/xP3。 - sungai_ul2_instructions.jsonl(约2300万条) 本数据集还包含基于cc100、OSCAR与mc4子集的140种语言构建的类UL2指令集。该UL2版本所基于的各原始数据集可参见:https://github.com/ontocord/sungai。 ## 免责声明 - 翻译可能存在不准确之处。由于UL2文件中的文本源自网络爬取数据,可能包含不当内容。 - 本次翻译由M2M 12B模型生成,且受限于40G显存限制,输出生成的Token最大长度被限制为512。 ## 许可证 由LAION志愿者创作的Anh数据集采用Apache 2.0许可证发布。但本数据集还包含采用其他宽松许可证授权的内容,或是基于合理使用原则使用的网络爬取数据。 ## 致谢 - 感谢LAION的Anh多语言聊天团队:@yp_yurilee、@cahya、@kevin ko、@lasse、@mattdf、@theblackcat102、@yongzx、@acul3、@logus2、@paulovn 以及其他众多贡献者。 - 感谢@rallio67 为跨语言数据集提供原始英文版本。 - 感谢@theblackcat102 在https://huggingface.co/datasets/theblackcat102/instruction_translations 发布的翻译成果,跨语言数据集即基于此构建。 - 感谢所有支撑Anh数据集的原始数据集作者,包括xp3数据集、OSCAR、cc100与mc4的作者团队。
提供机构:
maas
创建时间:
2025-10-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作