OpenLLM-France/Lucie-Training-Dataset
收藏Hugging Face2025-05-27 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/OpenLLM-France/Lucie-Training-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
Lucie Training Dataset是一个为训练大型语言模型(LLM)而设计的数据集,包含多种语言的文本数据,特别是大量的法语数据,旨在减少以英语为中心的文化偏见。数据集还包含了编程语言代码,用于增强LLM的推理能力。数据集经过严格的过滤和去重处理,确保了数据的质量和多样性。所有报纸、专著、杂志和立法文件,以及大多数书籍,都属于公共领域或拥有宽松的许可。
The Lucie Training Dataset is designed for training Large Language Models (LLMs) and contains a diverse collection of multilingual text data, with a significant amount of French data aimed at reducing anglo-centric cultural biases. The dataset also includes code from various programming languages to enhance the reasoning capabilities of LLMs. It has been rigorously filtered and deduplicated to ensure data quality and diversity. All newspapers, monographies, magazines, and legislative documents, as well as most books, are in the public domain or under permissive licenses.
提供机构:
OpenLLM-France



