aurora-m/aurora-m2
收藏Hugging Face2025-10-11 更新2025-10-18 收录
下载链接:
https://hf-mirror.com/datasets/aurora-m/aurora-m2
下载链接
链接失效反馈官方服务:
资源简介:
Aurora-M2 是一个基于 MixtureVitae 数据集的多语言、合成、去污染的预训练数据集,包含来自政府网站、cc-by 许可内容以及从许可来源派生的合成数据。该数据集旨在预训练基础的大型语言模型,涵盖商业、政治、科学、技术等多个领域。它还包含大量的合成数据,例如代码、数学和科学推理轨迹。README 文件还讨论了数据集的版权和许可问题,指出了潜在的风险和采取的措施来减轻这些风险。它还提到使用某些标记来分隔文档,并建议使用目标分词器中的适当标记替换它们。此外,它指出当前版本的数据集是一个工作版本,而不是最终版本,并计划发布一个经过更多严格去偏和匿名化的对齐版本。README 文件最后强调,在使用数据集时,咨询法律专家以了解任何潜在的版权风险非常重要。
The Aurora-M2 dataset is a multilingual, synthetic, and decontaminated pre-training dataset based on the MixtureVitae dataset, including data from various sources such as government websites, cc-by licensed content, and synthetic data derived from permissive sources. It is intended for pretraining foundational LLMs and covers a wide range of topics including business, politics, science, technology, and more. It also includes a significant amount of synthetic data, such as code, math, and science reasoning traces. The README discusses the copyright and licensing considerations of the dataset, highlighting the potential risks and the measures taken to mitigate them. It mentions the use of certain tokens to separate documents and suggests replacing them with appropriate tokens from the target tokenizer. Additionally, it notes that the current version of the dataset is a working version and not the final version, with plans to release an aligned version that undergoes more rigorous debiasing and anonymization. The README concludes by emphasizing the importance of consulting legal experts for any potential copyright risks associated with using the dataset.
提供机构:
aurora-m



