SynthCodeNet
收藏魔搭社区2026-01-06 更新2025-08-02 收录
下载链接:
https://modelscope.cn/datasets/ds4sd/SynthCodeNet
下载链接
链接失效反馈官方服务:
资源简介:
# SynthCodeNet
<div style="display: flex; justify-content: center; align-items: center;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/663e1254887b6f5645a0399f/whc8Bpip5P8uuzZOS0MQJ.png" alt="Code Example" style="width: 500px; height: auto">
</div>
**SynthCodeNet** is a multimodal dataset created for training the **SmolDocling** model. It consists of over **9.3 million** synthetically generated image-text pairs, covering code snippets from **56** different programming languages. Text data was sourced from permissively licensed sources, while images were synthetically generated at 120 DPI using LaTeX and Pygments to ensure visual diversity.
---
## Dataset Statistics
* **Total samples**: 9,334,257
* **Training set**: 8,400,838
* **Validation set**: 466,703
* **Test set**: 466,716
* **Modalities**: Image, Text
* **Image Generation**: Synthetic (LaTeX, Pygments)
### Programming Languages & Sample Counts
| Language | Samples | Language | Samples | Language | Samples |
| -------- | ------- | ---------- | ------- | ----------- | --------- |
| Ada | 20,094 | Dart | 20,415 | Matlab | 1,170 |
| Awk | 22,334 | Dockerfile | 99,459 | MoonScript | 6,237 |
| Bash | 98,950 | Elixir | 20,387 | Nim | 37,236 |
| C | 599,096 | Erlang | 20,039 | OCaml | 32,297 |
| C# | 303,720 | FORTRAN | 34,023 | ObjectiveC | 158,398 |
| C++ | 698,870 | Forth | 5,548 | Octave | 2,537 |
| CMake | 19,910 | Go | 333,722 | PHP | 249,566 |
| COBOL | 5,153 | HTML | 245,228 | Pascal | 28,254 |
| CSS | 236,596 | Haskell | 39,848 | Perl | 33,938 |
| Ceylon | 8,369 | Haxe | 20,070 | Prolog | 2,058 |
| Clojure | 20,765 | Java | 698,421 | Python | 1,797,063 |
| Crystal | 24,720 | JavaScript | 530,899 | Racket | 4,340 |
| Cuda | 142,344 | Julia | 29,681 | Ruby | 348,976 |
| Cython | 22,136 | Kotlin | 292,986 | Rust | 344,491 |
| D | 20,338 | Lisp | 29,749 | SML | 19,333 |
| Lua | 25,328 | SQL | 493,412 | YAML | 249,011 |
| Scala | 273,825 | Scheme | 23,242 | VisualBasic | 13,908 |
| Swift | 25,374 | TypeScript | 255,475 | XML | 246,209 |
| bc | 249 | dc | 1,713 | | |
---
## Data Format
Each dataset entry is structured as follows:
```json
{
"images": [PIL Image],
"texts": [
{
"assistant": "<loc_x0><loc_y0><loc_x1><loc_y1><_Language_>CODE_SNIPPET</code>",
"source": "SynthCodeNetNoImageTag",
"user": "<code>"
}
]
}
```
---
## Intended Use
* Training multimodal models for **document understanding**, specifically:
* Code snippet extraction and transcription
---
## Citation
If you use SynthCodeNet, please cite:
```bibtex
@article{nassar2025smoldocling,
title={SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion},
author={Nassar, Ahmed and Marafioti, Andres and Omenetti, Matteo and Lysak, Maksym and Livathinos, Nikolaos and Auer, Christoph and Morin, Lucas and de Lima, Rafael Teixeira and Kim, Yusik and Gurbuz, A Said and others},
journal={arXiv preprint arXiv:2503.11576},
year={2025}
}
```
# SynthCodeNet
<div style="display: flex; justify-content: center; align-items: center;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/663e1254887b6f5645a0399f/whc8Bpip5P8uuzZOS0MQJ.png" alt="代码示例" style="width: 500px; height: auto">
</div>
**SynthCodeNet** 是一款专为训练SmolDocling模型打造的多模态数据集。该数据集包含超过930万条合成生成的图文对,涵盖来自56种不同编程语言的代码片段。文本数据源自采用宽松许可协议的数据源,而图像则通过LaTeX(LaTeX)和Pygments(Pygments)以120 DPI(每英寸点数,Dots Per Inch)合成生成,以确保视觉多样性。
---
## 数据集统计信息
* **总样本数**:9,334,257
* **训练集**:8,400,838
* **验证集**:466,703
* **测试集**:466,716
* **模态类型**:图像、文本
* **图像生成方式**:合成生成(LaTeX、Pygments)
### 编程语言与样本数量
| 语言 | 样本数 | 语言 | 样本数 | 语言 | 样本数 |
| -------- | ------- | ---------- | ------- | ----------- | --------- |
| 阿达语言(Ada) | 20,094 | Dart(Dart) | 20,415 | Matlab(Matlab) | 1,170 |
| Awk(Awk) | 22,334 | Dockerfile(Dockerfile) | 99,459 | MoonScript(MoonScript) | 6,237 |
| Bash(Bash) | 98,950 | Elixir(Elixir) | 20,387 | Nim(Nim) | 37,236 |
| C语言(C) | 599,096 | Erlang(Erlang) | 20,039 | OCaml(OCaml) | 32,297 |
| C#语言(C#) | 303,720 | FORTRAN(FORTRAN) | 34,023 | ObjectiveC(ObjectiveC) | 158,398 |
| C++语言(C++) | 698,870 | Forth(Forth) | 5,548 | Octave(Octave) | 2,537 |
| CMake(CMake) | 19,910 | Go语言(Go) | 333,722 | PHP(PHP) | 249,566 |
| COBOL语言(COBOL) | 5,153 | HTML(HTML) | 245,228 | Pascal(Pascal) | 28,254 |
| CSS(CSS) | 236,596 | Haskell(Haskell) | 39,848 | Perl(Perl) | 33,938 |
| Ceylon(Ceylon) | 8,369 | Haxe(Haxe) | 20,070 | Prolog(Prolog) | 2,058 |
| Clojure(Clojure) | 20,765 | Java语言(Java) | 698,421 | Python语言(Python) | 1,797,063 |
| Crystal(Crystal) | 24,720 | JavaScript(JavaScript) | 530,899 | Racket(Racket) | 4,340 |
| Cuda(Cuda) | 142,344 | Julia(Julia) | 29,681 | Ruby(Ruby) | 348,976 |
| Cython(Cython) | 22,136 | Kotlin(Kotlin) | 292,986 | Rust(Rust) | 344,491 |
| D语言(D) | 20,338 | Lisp(Lisp) | 29,749 | SML(SML) | 19,333 |
| Lua语言(Lua) | 25,328 | SQL(SQL) | 493,412 | YAML(YAML) | 249,011 |
| Scala(Scala) | 273,825 | Scheme(Scheme) | 23,242 | VisualBasic(VisualBasic) | 13,908 |
| Swift语言(Swift) | 25,374 | TypeScript(TypeScript) | 255,475 | XML(XML) | 246,209 |
| bc(bc) | 249 | dc(dc) | 1,713 | | |
---
## 数据格式
每条数据集条目结构如下:
json
{
"images": [PIL图像(Python Imaging Library,PIL)],
"texts": [
{
"assistant": "<loc_x0><loc_y0><loc_x1><loc_y1><_Language_>CODE_SNIPPET</code>",
"source": "SynthCodeNetNoImageTag",
"user": "<code>"
}
]
}
---
## 预期用途
* 训练面向**文档理解**的多模态模型,具体包括:
* 代码片段提取与转录
---
## 引用方式
若使用SynthCodeNet数据集,请引用以下文献:
bibtex
@article{nassar2025smoldocling,
title={SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion},
author={Nassar, Ahmed and Marafioti, Andres and Omenetti, Matteo and Lysak, Maksym and Livathinos, Nikolaos and Auer, Christoph and Morin, Lucas and de Lima, Rafael Teixeira and Kim, Yusik and Gurbuz, A Said and others},
journal={arXiv preprint arXiv:2503.11576},
year={2025}
}
提供机构:
maas
创建时间:
2025-08-01



