The-Tome
收藏魔搭社区2025-11-12 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/The-Tome
下载链接
链接失效反馈官方服务:
资源简介:
<div align="center">
<img src="https://i.ibb.co/0jqCGH6/LEW5-CGBKRv-CWKNf-KYkf-k-Q.jpg" alt="The Tome" style="border-radius: 10px; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19); max-width: 100%; height: auto;">
</div>
The Tome is a curated dataset designed for training large language models with a focus on instruction following. It was used in the training of our Arcee-Nova/Spark models, which was later merged with Qwen2-72B-Instruct (or 7B with the Spark model).
## Dataset Composition
- **Total Samples**: 1.75 million
- **Source**: Compiled from 9 publicly available datasets
The Tome is comprised of the following datasets:
```markdown
arcee-ai/infini-instruct-top-500k (BAAI/Infinity-Instruct)
TIGER-Lab/WebInstructSub (top-500k)
jondurbin/airoboros-3.2
gardner/glaive-function-calling-v2-sharegpt
arcee-ai/reasoning-sharegpt (SkunkworksAI/reasoning-0.01)
arcee-ai/self-instruct-sharegpt (bigcode/self-oss-instruct-sc2-exec-filter-50k)
cognitivecomputations/ultrainteract_trajectories_sharegpt
cognitivecomputations/SystemChat-2.0
arcee-ai/qwen2-72b-magpie-en
```
## Curation Process
The dataset underwent a curation process to ensure high-quality content:
1. **Reranker**: Applied for instruction following on Infini-Instruct and WebInstruct.
2. **Educational Value Scoring**: Used the fineweb-edu classifier on Infini-Instruct and WebInstruct
3. **Composite Scoring**: Scores from the custom reranker and fineweb-edu classifier were averaged.
## Usage in Model Training
The Tome was instrumental in the development of the Nova model, which was subsequently merged with Qwen2-72B-Instruct:
- **Merge Process**:
- Lower layers primarily from Qwen2-72B-Instruct
- Higher layers primarily from Nova-Premerge
<div align="center">
<img src="https://i.ibb.co/0jqCGH6/LEW5-CGBKRv-CWKNf-KYkf-k-Q.jpg" alt="《The Tome》数据集示意图" style="border-radius: 10px; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19); max-width: 100%; height: auto;">
</div>
《The Tome》是一款经过精心精选的数据集,专为训练聚焦于指令遵循任务的大语言模型(Large Language Model)而打造。该数据集曾被用于训练我们的Arcee-Nova/Spark模型,该模型后续与Qwen2-72B-Instruct进行了合并(若使用Spark模型,则对应7B版本的合并)。
## 数据集构成
- **总样本量**:175万
- **数据来源**:由9个公开可用的数据集汇编而成
《The Tome》包含以下数据集:
markdown
arcee-ai/infini-instruct-top-500k(BAAI/Infinity-Instruct)
TIGER-Lab/WebInstructSub(top-500k)
jondurbin/airoboros-3.2
gardner/glaive-function-calling-v2-sharegpt
arcee-ai/reasoning-sharegpt(SkunkworksAI/reasoning-0.01)
arcee-ai/self-instruct-sharegpt(bigcode/self-oss-instruct-sc2-exec-filter-50k)
cognitivecomputations/ultrainteract_trajectories_sharegpt
cognitivecomputations/SystemChat-2.0
arcee-ai/qwen2-72b-magpie-en
## 精选流程
该数据集经过了严格的内容精选流程以保障内容质量:
1. **重排序器(Reranker)**:针对Infini-Instruct与WebInstruct的指令遵循任务应用了重排序模型。
2. **教育价值评分**:使用FineWeb-Edu分类器对Infini-Instruct与WebInstruct进行内容评分。
3. **综合评分**:将自定义重排序器与FineWeb-Edu分类器的得分进行平均计算。
## 模型训练中的应用
《The Tome》在Nova模型的开发过程中发挥了关键作用,该模型后续与Qwen2-72B-Instruct完成了合并:
- **合并流程**:
- 模型低层主要源自Qwen2-72B-Instruct
- 模型高层主要源自Nova-Premerge
提供机构:
maas
创建时间:
2024-07-24



