tomg-group-umd/huginn-dataset
收藏Hugging Face2025-07-15 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/tomg-group-umd/huginn-dataset
下载链接
链接失效反馈官方服务:
资源简介:
Huginn 数据集是一个用于训练 `huginn-0125` 模型的数据集,由多个来源数据集混合而成,主要专注于代码和数学推理数据。数据集包括标准来源和合成来源,以及用于指令遵循和数学推理的专用数据集。数据以半预处理格式提供,包括 4096 个用于训练和验证的 parquet 文件,每个文件包含 4097 个标记。数据集旨在最大限度地发挥推理行为出现的潜力,并帮助模型获取标准的语言建模能力。
The Huginn Dataset is a mixed dataset of various source datasets used for training the `huginn-0125` model, primarily focusing on code and mathematical reasoning data. It includes both standard and synthetic sources, as well as specialized datasets for tasks like instruction following and mathematical reasoning. The data is provided in a semi-prepared format with 4096 parquet files for training and validation, each containing 4097 tokens per row. The dataset is designed to maximize the potential for emergent reasoning behaviors and to help the model acquire standard language modeling abilities.
提供机构:
tomg-group-umd



