The-Tome

Name: The-Tome
Creator: maas
Published: 2025-11-12 16:16:14
License: 暂无描述

魔搭社区2025-11-12 更新2024-08-31 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/The-Tome

下载链接

链接失效反馈

官方服务：

资源简介：

<div align="center"> <img src="https://i.ibb.co/0jqCGH6/LEW5-CGBKRv-CWKNf-KYkf-k-Q.jpg" alt="The Tome" style="border-radius: 10px; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19); max-width: 100%; height: auto;"> </div> The Tome is a curated dataset designed for training large language models with a focus on instruction following. It was used in the training of our Arcee-Nova/Spark models, which was later merged with Qwen2-72B-Instruct (or 7B with the Spark model). ## Dataset Composition - **Total Samples**: 1.75 million - **Source**: Compiled from 9 publicly available datasets The Tome is comprised of the following datasets: ```markdown arcee-ai/infini-instruct-top-500k (BAAI/Infinity-Instruct) TIGER-Lab/WebInstructSub (top-500k) jondurbin/airoboros-3.2 gardner/glaive-function-calling-v2-sharegpt arcee-ai/reasoning-sharegpt (SkunkworksAI/reasoning-0.01) arcee-ai/self-instruct-sharegpt (bigcode/self-oss-instruct-sc2-exec-filter-50k) cognitivecomputations/ultrainteract_trajectories_sharegpt cognitivecomputations/SystemChat-2.0 arcee-ai/qwen2-72b-magpie-en ``` ## Curation Process The dataset underwent a curation process to ensure high-quality content: 1. **Reranker**: Applied for instruction following on Infini-Instruct and WebInstruct. 2. **Educational Value Scoring**: Used the fineweb-edu classifier on Infini-Instruct and WebInstruct 3. **Composite Scoring**: Scores from the custom reranker and fineweb-edu classifier were averaged. ## Usage in Model Training The Tome was instrumental in the development of the Nova model, which was subsequently merged with Qwen2-72B-Instruct: - **Merge Process**: - Lower layers primarily from Qwen2-72B-Instruct - Higher layers primarily from Nova-Premerge

<div align="center"> <img src="https://i.ibb.co/0jqCGH6/LEW5-CGBKRv-CWKNf-KYkf-k-Q.jpg" alt="《The Tome》数据集示意图" style="border-radius: 10px; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19); max-width: 100%; height: auto;"> </div> 《The Tome》是一款经过精心精选的数据集，专为训练聚焦于指令遵循任务的大语言模型（Large Language Model）而打造。该数据集曾被用于训练我们的Arcee-Nova/Spark模型，该模型后续与Qwen2-72B-Instruct进行了合并（若使用Spark模型，则对应7B版本的合并）。 ## 数据集构成 - **总样本量**：175万 - **数据来源**：由9个公开可用的数据集汇编而成《The Tome》包含以下数据集： markdown arcee-ai/infini-instruct-top-500k（BAAI/Infinity-Instruct） TIGER-Lab/WebInstructSub（top-500k） jondurbin/airoboros-3.2 gardner/glaive-function-calling-v2-sharegpt arcee-ai/reasoning-sharegpt（SkunkworksAI/reasoning-0.01） arcee-ai/self-instruct-sharegpt（bigcode/self-oss-instruct-sc2-exec-filter-50k） cognitivecomputations/ultrainteract_trajectories_sharegpt cognitivecomputations/SystemChat-2.0 arcee-ai/qwen2-72b-magpie-en ## 精选流程该数据集经过了严格的内容精选流程以保障内容质量： 1. **重排序器（Reranker）**：针对Infini-Instruct与WebInstruct的指令遵循任务应用了重排序模型。 2. **教育价值评分**：使用FineWeb-Edu分类器对Infini-Instruct与WebInstruct进行内容评分。 3. **综合评分**：将自定义重排序器与FineWeb-Edu分类器的得分进行平均计算。 ## 模型训练中的应用《The Tome》在Nova模型的开发过程中发挥了关键作用，该模型后续与Qwen2-72B-Instruct完成了合并： - **合并流程**： - 模型低层主要源自Qwen2-72B-Instruct - 模型高层主要源自Nova-Premerge

提供机构：

maas

创建时间：

2024-07-24

5,000+

优质数据集

54 个

任务类型

进入经典数据集