five

The-Tome

收藏
魔搭社区2025-11-12 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/The-Tome
下载链接
链接失效反馈
官方服务:
资源简介:
<div align="center"> <img src="https://i.ibb.co/0jqCGH6/LEW5-CGBKRv-CWKNf-KYkf-k-Q.jpg" alt="The Tome" style="border-radius: 10px; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19); max-width: 100%; height: auto;"> </div> The Tome is a curated dataset designed for training large language models with a focus on instruction following. It was used in the training of our Arcee-Nova/Spark models, which was later merged with Qwen2-72B-Instruct (or 7B with the Spark model). ## Dataset Composition - **Total Samples**: 1.75 million - **Source**: Compiled from 9 publicly available datasets The Tome is comprised of the following datasets: ```markdown arcee-ai/infini-instruct-top-500k (BAAI/Infinity-Instruct) TIGER-Lab/WebInstructSub (top-500k) jondurbin/airoboros-3.2 gardner/glaive-function-calling-v2-sharegpt arcee-ai/reasoning-sharegpt (SkunkworksAI/reasoning-0.01) arcee-ai/self-instruct-sharegpt (bigcode/self-oss-instruct-sc2-exec-filter-50k) cognitivecomputations/ultrainteract_trajectories_sharegpt cognitivecomputations/SystemChat-2.0 arcee-ai/qwen2-72b-magpie-en ``` ## Curation Process The dataset underwent a curation process to ensure high-quality content: 1. **Reranker**: Applied for instruction following on Infini-Instruct and WebInstruct. 2. **Educational Value Scoring**: Used the fineweb-edu classifier on Infini-Instruct and WebInstruct 3. **Composite Scoring**: Scores from the custom reranker and fineweb-edu classifier were averaged. ## Usage in Model Training The Tome was instrumental in the development of the Nova model, which was subsequently merged with Qwen2-72B-Instruct: - **Merge Process**: - Lower layers primarily from Qwen2-72B-Instruct - Higher layers primarily from Nova-Premerge

<div align="center"> <img src="https://i.ibb.co/0jqCGH6/LEW5-CGBKRv-CWKNf-KYkf-k-Q.jpg" alt="《The Tome》数据集示意图" style="border-radius: 10px; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19); max-width: 100%; height: auto;"> </div> 《The Tome》是一款经过精心精选的数据集,专为训练聚焦于指令遵循任务的大语言模型(Large Language Model)而打造。该数据集曾被用于训练我们的Arcee-Nova/Spark模型,该模型后续与Qwen2-72B-Instruct进行了合并(若使用Spark模型,则对应7B版本的合并)。 ## 数据集构成 - **总样本量**:175万 - **数据来源**:由9个公开可用的数据集汇编而成 《The Tome》包含以下数据集: markdown arcee-ai/infini-instruct-top-500k(BAAI/Infinity-Instruct) TIGER-Lab/WebInstructSub(top-500k) jondurbin/airoboros-3.2 gardner/glaive-function-calling-v2-sharegpt arcee-ai/reasoning-sharegpt(SkunkworksAI/reasoning-0.01) arcee-ai/self-instruct-sharegpt(bigcode/self-oss-instruct-sc2-exec-filter-50k) cognitivecomputations/ultrainteract_trajectories_sharegpt cognitivecomputations/SystemChat-2.0 arcee-ai/qwen2-72b-magpie-en ## 精选流程 该数据集经过了严格的内容精选流程以保障内容质量: 1. **重排序器(Reranker)**:针对Infini-Instruct与WebInstruct的指令遵循任务应用了重排序模型。 2. **教育价值评分**:使用FineWeb-Edu分类器对Infini-Instruct与WebInstruct进行内容评分。 3. **综合评分**:将自定义重排序器与FineWeb-Edu分类器的得分进行平均计算。 ## 模型训练中的应用 《The Tome》在Nova模型的开发过程中发挥了关键作用,该模型后续与Qwen2-72B-Instruct完成了合并: - **合并流程**: - 模型低层主要源自Qwen2-72B-Instruct - 模型高层主要源自Nova-Premerge
提供机构:
maas
创建时间:
2024-07-24
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作