SYNTH
收藏魔搭社区2026-01-02 更新2025-11-15 收录
下载链接:
https://modelscope.cn/datasets/PleIAs/SYNTH
下载链接
链接失效反馈官方服务:
资源简介:
# SYNTH
<div align="center">
<img src="figures/pleias.png" width="60%" alt="Pleias" />
</div>
<p align="center">
<a href="https://pleias.fr/blog/blogsynth-the-new-data-frontier"><b>Blog announcement</b></a>
</p>
**SYNTH** is the first open generalist synthetic dataset for training small reasoning model end-to-end, jointly released by Pleias and the AI Alliance.
SYNTH includes 79,648,272 individual text samples, comprising over 41 billion words (about 75 billion tokens with Pleias tokenizer). It is based on the amplification of 58,698 articles from Wikipedia and made possible thanks to the *Structured Wikipedia* dataset from Wikimedia Enterprise.
SYNTH differs from existing open synthetic dataset in being:
* **fully open** based on seed text under open license (CC-By-SA) and generated with models allowing for output reuse. This means that SYNTH can be universally release and serve as a basis for further reproducible synthetic pipelines.
* **state of the art** for small models below 350 million parameters. We release two models train on SYNTH achieving current best results for size range on MMLU and other standard evaluation metrics.
* **data efficient** with best results attained with only 100-200 billions tokens trained on SYNTH.
* **reasoning by design** with all generated answers being accompanied with intermediary reasoning traces in an entirely new syntax.
* **diverse** comprising a wide range of exercises that cover many use cases of small models: retrieval-augmented generation, creative writing, arithmetics, information extraction, etc.
* **multilingual** with about 20% of all texts in other languages than English, for now limited on European languages (German, French, Spanish, Italian, Polish, Dutch, Latin).
SYNTH is not only the name of a dataset but an initiative for open synthetic data and open environment led by AI Alliance and Pleias that aims to address the critical gap in open-source AI development by creating a cutting-edge, open-source data corpus for training sovereign AI models and advanced AI agents.
## Dataset Design
## Amplified knowledge
At its core, SYNTH is a fully synthetic and engineered corpus derived from a sample of 50,000 pages curated by the Wikipedia community. Throughout the past two decades, thousands of contributors selected a collection of core topics that every encyclopedia should have, Wikipedia:Vital articles. It’s a concentric selection starting at level 1 (10 articles) up to level 5 (50,000 articles). SYNTH includes as its starting point all articles featured in level 5.
SYNTH further expands on this core nucleus with three additional seed collections:
* **specialized articles**: following on intermediary evaluation, we added 8,698 articles to reinforce coverage of specific fields like law, medicine, chemistry. Selection was based on category tree search analysis and aimed to fill remaining holes in knowledge coverage from Wikipedia:Vital articles.
* **textbooks**: wikipedia articles are focused on encyclopedic knowledge but lag on *practical* knowledge and *how to*, which happens to be the focus of another Wikimedia project, Wikibooks. For now we included 3,727 pages on cooking from Wikibooks but looking forward to expand on additional forms of experential knowledge (gardening, language acquisition, etc.)
* **recent/self knowledge**: we incorporated a small sample of 130 texts hand-crafted internally to expand model familiarity with recent events, self-awareness about training condition and general research information on AI. This collection has been highly amplified.
This content act as the SYNTH memory base and has been amplified at a minimum 100 times (about 10000 times for recent/self knowledge). Our amplification strategy relies on a new synthetic pipeline, partly inspired by RAG applications:
* Selection of individual consistent **sections** from the original articles (about 250,000 for the core sample of 50,000 pages).
* Generation of queries with randomized constraints for style variation, query outcomes. It proved especially determining to have enough negative queries to reinforce world knowledge and limit hallucinations.
## Synthetic exercises
The approach has been originally explored by Pleias for retrieval-augmented generation. It has been extended to virtually most of the expected use case of small reasoning models:
* **arithmetics**
* **creative writing** We injected randomized constraints
## Dataset Details
### Dataset Description
- **Curated by:** Wikipedia community (Wikipedia:Vital Articles) and Pleias.
- **Funded by [optional]:** Pleias
- **Shared by [optional]:** Pleias
- **Language(s) (NLP):** English (80%), French, German, Italian, Spanish, Polish, Dutch and Latin.
- **License:**
### Dataset Sources [optional]
While the final training data is fully synthetic, it relied on seeds collected from three data sources:
- **[Structured Wikipedia](https://huggingface.co/datasets/wikimedia/structured-wikipedia):** We used directly the dumps made available by the Wikimedia Foundation.
- **Wikibooks:** extracted through the official Wikimedia API.
- **Internal documents from Pleias:** mostly model-self documentation and few updated information.
## Uses
The dataset aims to support data efficient training of small reasoning model. It provide a generalist, self-sufficient collection of multilingual amplified encyclopedic texts along with synthetic reasoning traces, as well as synthetic tasks that reinforce most of the expected capacities of small model.
In contrast with organic pretraining dataset, SYNTH allows for fast convergence to the existing SOTA (about 100 billion tokens). Furthermore, SYNTH is fully releasable, only use sourced text under free license.
Overall, SYNTH aims to support an emerging ecosystem of small training model by providing a reusable generalist foundational dataset.
### Direct Use
Direct use include:
- **Pretraining of small reasoning models**: the dataset is sufficient to elicit most expected capacities in small models.
- **Mid-training/fine-tuning of existing models**: we already led successful experiments with Pleias-350m.
- **Research/explainability experiment**: with its openness and data efficiency, SYNTH should be an ideal resource for research on model memorization or skill acquisition.
### Out-of-Scope Use
Current out-of-scope use include:
- **Code generation**: we intently excluded code data from SYNTH as this would require the development of specific synthetic pipeline.
- **Global multilingual support**: SYNTH only claims support from our current list of eight languages.
- **Training of large models**: the difficulty of synthetic exercises has been calibrated for models smaller than a few billion parameters.
Yet, SYNTH is a live resources and we intend to cover some of these use cases in future releases.
## Dataset Structure
| Field | Type | Description |
| ----------------------- | -------- | ------------------------------------------------------------------------------------------------------------------- |
| **synth_id** | `string` | Unique synthetic identifier for each generated sample. |
| **language** | `string` | Language of the text sample (e.g., `"en"`, `"fr"`, `"it"`, `"es"`, `"de"`, `"pl"`, `"nl"`, `"la"`). |
| **exercise** | `string` | Type of synthetic exercise (e.g., reasoning, writing, retrieval, arithmetic). Describes the synthetic task context. |
| **model** | `string` | Finetuned model used to generate the synthetic sample |
| **query** | `string` | Backtranslated query. |
| **query_seed_url** | `string` | URL of the Wikipedia or Wikibooks section that served as the seed for query generation. |
| **query_seed_text** | `string` | Extend text used as seed for query generation. |
| **additional_seed_url** | `string` | Optional additional URL(s) used as supplementary seed |
| **seed_license** | `string` | License of the seed text (most of the time `"CC-BY-SA 4.0"`). |
| **constraints** | `string` | Generation constraints applied to answer generation. Varies depending on the exercise |
| **script** | `string` | Internal template or script identifier defining the structure of the synthetic exercise. |
| **synthetic_reasoning** | `string` | Generated reasoning draft. |
| **synthetic_answer** | `string` | Final generated answer or output corresponding to the query. |
| **words** | `int64` | Word count of the full generated text sample (query + draft + answer) |
## Dataset Creation
### Curation Rationale
SYNTH is structured around a “memory core”, the Wikipedia vital articles.. Throughout the past two decades, thousands of contributors selected a collection of core topics that every encyclopedia should have: it’s a concentric selection starting at level 1 (10 articles) up to level 5 (50,000 articles). SYNTH includes as its starting point all articles featured in level 5. It further expands on this selection by increasing coverage of more specialized domains (physics, chemistry, law…) through targeted expansion of wikidata knowledge graphs.
### Source Data
The 58,698 Wikipedia articles were collected thanks to ''Structured Wikipedia'', a project from Wikimedia Enterprise that parsed directly rendered Wikipedia articles in html. Structured Wikipedia fixed most of the formatting issues linked with the mediawiki syntax and provides a clean, section-based version of all Wikipedia pages.
We additionally extracted 3,000 cooking recipes from Wikibooks using the standard API method from Wikimedia.
#### Data Collection and Processing
#### Who are the source data producers?
The main sourced dataset used for synthetic amplification was curated by the English Wikipedia communities throughout nearly 2 decades. Rationale for selection are available on the relevant talk pages of Wikipedia:Vital articles.
The selection reflect similar bias for "canon" general knowledge in English-speaking countries than major LLM benchmarks like MMLU (drawn from high school exams).
#### Personal and Sensitive Information
The dataset only contain encyclopedic information on highly well-known historical people. No PII curation was needed.
## Bias, Risks, and Limitations
The dataset was created from a collection of 50,000 Wikipedia articles curated by the community (Wikipedia:Vital Articles).
On top of the well documented structural bias in Wikipedia contribution and editing, the selection has been intently made from the perspective of western US/European culture.
Due to systematic Wikipedia grounding, the data presents a very low risk of toxic or problematic content, as well as poor or highly hallucinated information.
# SYNTH
<div align="center">
<img src="figures/pleias.png" width="60%" alt="Pleias" />
</div>
<p align="center">
<a href="https://pleias.fr/blog/blogsynth-the-new-data-frontier"><b>官方博客公告</b></a>
</p>
**SYNTH** 是全球首款面向小型推理模型端到端训练的开放通用合成数据集,由Pleias与AI联盟(AI Alliance)联合发布。
SYNTH 包含79,648,272条独立文本样本,总字数超过410亿(使用Pleias分词器(tokenizer)统计约含750亿Token)。该数据集基于58,698条维基百科文章扩增生成,其实现依托于Wikimedia Enterprise推出的*结构化维基百科(Structured Wikipedia)*数据集。
SYNTH与现有开放合成数据集的区别在于:
* **完全开放**:其种子文本采用知识共享署名-相同方式共享(CC-BY-SA)许可,且生成模型支持输出复用。这意味着SYNTH可全域发布,可作为后续可复现合成数据流水线的基础。
* **顶尖性能**:针对3.5亿参数以下的小型模型达到当前最优水准。我们发布了两款基于SYNTH训练的模型,在MMLU及其他标准评估指标上的表现为本尺寸区间的当前最佳结果。
* **数据高效**:仅需在SYNTH上训练1000-2000亿Token即可获得最佳效果。
* **内置推理能力**:所有生成答案均附带全新语法格式的中间推理轨迹。
* **场景多元**:涵盖覆盖小型模型多类应用场景的丰富习题:检索增强生成(Retrieval-Augmented Generation,RAG)、创意写作、算术运算、信息抽取等。
* **多语言支持**:约20%的文本为非英语语种,目前仅限欧洲语言(德语、法语、西班牙语、意大利语、波兰语、荷兰语、拉丁语)。
SYNTH不仅是一个数据集的名称,更是由AI联盟与Pleias主导的开放合成数据与开放生态倡议,旨在填补开源AI开发中的关键空白,打造用于训练自主AI模型与高级AI智能体(AI Agent)的前沿开源语料库。
## 数据集设计
### 扩增知识库
其核心为完全合成与工程化构建的语料库,源自维基百科社区精选的50,000页样本。在过去二十年中,数千名贡献者遴选了每部百科全书都应涵盖的核心主题集合——维基百科:重要条目(Wikipedia:Vital articles)。该遴选采用同心圆层级结构,从第1级(10条条目)延伸至第5级(50,000条条目)。SYNTH以第5级的全部条目作为初始核心语料。
SYNTH进一步通过三类额外种子语料库扩展核心知识库:
* **专业条目**:经过中间评估后,我们新增了8,698条条目以强化特定领域的覆盖,如法律、医学、化学。遴选基于分类树搜索分析,旨在填补维基百科:重要条目在知识覆盖上的空白。
* **教科书内容**:维基百科条目专注于百科知识,但在实用知识与“操作指南”方面存在短板,而这正是维基图书(Wikibooks)这类维基媒体项目的核心方向。目前我们已收录来自维基图书的3,727页烹饪相关内容,未来计划拓展更多体验式知识领域(如园艺、语言学习等)。
* **近期/自知识**:我们内部手工编写了130条文本样本,用于提升模型对近期事件的熟悉度、对训练条件的自我认知,以及AI相关通用研究信息。该类语料已进行了大规模扩增。
此类内容构成SYNTH的记忆基础,且至少经过了100倍的扩增(近期/自知识类语料的扩增倍率约为10000倍)。我们的扩增策略依托于全新的合成数据流水线,部分灵感源自检索增强生成(RAG)应用:
* 从原始文章中遴选独立且一致的**章节**(50,000页核心样本对应约250,000个章节)。
* 生成带有随机约束的查询,以实现风格多样化与结果多样化。实践证明,构建足够数量的负向查询对强化世界知识、限制幻觉至关重要。
### 合成习题
该方法最初由Pleias针对检索增强生成(RAG)场景探索,现已拓展至小型推理模型几乎所有预期应用场景:
* **算术运算**
* **创意写作**:我们注入了随机约束条件
## 数据集详情
### 数据集描述
- **编纂方**:维基百科社区(Wikipedia:Vital Articles)与Pleias
- **资助方(可选)**:Pleias
- **发布方(可选)**:Pleias
- **语言(自然语言处理)**:英语(80%)、法语、德语、意大利语、西班牙语、波兰语、荷兰语与拉丁语
- **许可协议**:
### 数据集来源(可选)
尽管最终训练数据为完全合成数据,但其依赖从三类数据源收集的种子语料:
- **[结构化维基百科(Structured Wikipedia)](https://huggingface.co/datasets/wikimedia/structured-wikipedia)**:我们直接使用维基媒体基金会提供的数据集快照。
- **维基图书(Wikibooks)**:通过维基媒体官方API提取。
- **Pleias内部文档**:主要为模型自文档与少量更新信息。
## 应用场景
本数据集旨在支持小型推理模型的数据高效训练。它提供了一套通用、自给自足的多语言扩增百科文本集合,附带合成推理轨迹,以及可强化小型模型多数预期能力的合成任务。
与原生预训练数据集相比,SYNTH可快速收敛至现有最优水准(约需1000亿Token)。此外,SYNTH可完全公开发布,仅使用开放许可的源文本。
总体而言,SYNTH旨在通过提供可复用的通用基础数据集,支撑小型训练模型的新兴生态系统。
### 直接应用
直接应用包括:
- **小型推理模型预训练**:本数据集足以激发小型模型的多数预期能力。
- **现有模型的中期训练/微调**:我们已基于Pleias-350m开展了成功的实验。
- **研究/可解释性实验**:凭借其开放性与数据高效性,SYNTH应成为研究模型记忆或技能习得的理想资源。
### 超出范围的应用
当前超出范围的应用包括:
- **代码生成**:我们刻意将代码数据排除在SYNTH之外,因为这需要开发专用的合成数据流水线。
- **全域多语言支持**:SYNTH仅支持当前列出的八种语言。
- **大型模型训练**:合成习题的难度已针对参数规模不足数十亿的模型进行校准。
不过,SYNTH是一项持续迭代的资源,我们计划在未来的版本中覆盖部分上述应用场景。
## 数据集结构
| 字段名 | 数据类型 | 描述 |
| ----------------------- | ---------- | ------------------------------------------------------------------------------------------------------------ |
| **synth_id** | `string` | 每个生成样本的唯一合成标识符。 |
| **language** | `string` | 文本样本的语言(例如:`"en"`、`"fr"`、`"it"`、`"es"`、`"de"`、`"pl"`、`"nl"`、`"la"`)。 |
| **exercise** | `string` | 合成习题的类型(例如:推理、写作、检索、算术),用于描述合成任务的上下文。 |
| **model** | `string` | 用于生成合成样本的微调模型名称 |
| **query** | `string` | 回译查询。 |
| **query_seed_url** | `string` | 作为查询生成种子的维基百科或维基图书章节的URL。 |
| **query_seed_text** | `string` | 用作查询生成种子的扩展文本。 |
| **additional_seed_url** | `string` | 可选的附加种子URL(多个)。 |
| **seed_license** | `string` | 种子文本的许可协议(绝大多数情况下为`"CC-BY-SA 4.0"`)。 |
| **constraints** | `string` | 应用于答案生成的约束条件,随习题类型而异。 |
| **script** | `string` | 用于定义合成习题结构的内部模板或脚本标识符。 |
| **synthetic_reasoning** | `string` | 生成的推理草稿。 |
| **synthetic_answer** | `string` | 与查询对应的最终生成答案或输出。 |
| **words** | `int64` | 完整生成文本样本的总词数(查询+推理草稿+答案) |
## 数据集创建
### 编纂逻辑
SYNTH以“记忆核心”——维基百科重要条目——为结构基础。在过去二十年中,数千名贡献者遴选了每部百科全书都应涵盖的核心主题集合:该遴选采用同心圆层级结构,从第1级(10条条目)延伸至第5级(50,000条条目)。SYNTH以第5级的全部条目作为初始核心语料,并通过针对维基数据知识图谱的定向扩展,进一步拓展了物理、化学、法律等更多专业领域的覆盖范围。
### 源数据
58,698条维基百科条目依托Wikimedia Enterprise的*结构化维基百科(Structured Wikipedia)*项目收集,该项目直接解析渲染后的HTML格式维基百科文章。结构化维基百科解决了多数与MediaWiki语法相关的格式问题,提供了清晰的章节化版本的所有维基百科页面。
我们还通过维基媒体的标准API方法从维基图书中提取了3,000份烹饪食谱。
#### 数据收集与处理
#### 源数据生产者
用于合成扩增的主要源数据集由英文维基百科社区在近二十年的时间里编纂完成。相关遴选逻辑可在维基百科:重要条目的对应讨论页面查看。
该遴选与MMLU等主流大语言模型基准(源自高中考试)一样,反映了英语国家“正统”通用知识的类似偏差。
#### 个人与敏感信息
本数据集仅包含关于高度知名历史人物的百科信息,无需进行个人可识别信息(Personally Identifiable Information, PII)的清理工作。
## 偏差、风险与局限性
本数据集源自社区精选的50,000条维基百科条目(维基百科:重要条目)。
除维基百科贡献与编辑中已被广泛记录的结构性偏差外,本次遴选刻意偏向西方美欧文化视角。
由于本数据集系统性依托维基百科内容进行锚定,因此几乎不存在有毒或不当内容,也极少出现低质量或高度幻觉的信息。
提供机构:
maas
创建时间:
2025-11-11



