SynthCodeNet

Name: SynthCodeNet
Creator: maas
Published: 2026-01-06 16:40:47
License: 暂无描述

魔搭社区2026-01-06 更新2025-08-02 收录

下载链接：

https://modelscope.cn/datasets/ds4sd/SynthCodeNet

下载链接

链接失效反馈

官方服务：

资源简介：

# SynthCodeNet <div style="display: flex; justify-content: center; align-items: center;"> <img src="https://cdn-uploads.huggingface.co/production/uploads/663e1254887b6f5645a0399f/whc8Bpip5P8uuzZOS0MQJ.png" alt="Code Example" style="width: 500px; height: auto"> </div> **SynthCodeNet** is a multimodal dataset created for training the **SmolDocling** model. It consists of over **9.3 million** synthetically generated image-text pairs, covering code snippets from **56** different programming languages. Text data was sourced from permissively licensed sources, while images were synthetically generated at 120 DPI using LaTeX and Pygments to ensure visual diversity. --- ## Dataset Statistics * **Total samples**: 9,334,257 * **Training set**: 8,400,838 * **Validation set**: 466,703 * **Test set**: 466,716 * **Modalities**: Image, Text * **Image Generation**: Synthetic (LaTeX, Pygments) ### Programming Languages & Sample Counts | Language | Samples | Language | Samples | Language | Samples | | -------- | ------- | ---------- | ------- | ----------- | --------- | | Ada | 20,094 | Dart | 20,415 | Matlab | 1,170 | | Awk | 22,334 | Dockerfile | 99,459 | MoonScript | 6,237 | | Bash | 98,950 | Elixir | 20,387 | Nim | 37,236 | | C | 599,096 | Erlang | 20,039 | OCaml | 32,297 | | C# | 303,720 | FORTRAN | 34,023 | ObjectiveC | 158,398 | | C++ | 698,870 | Forth | 5,548 | Octave | 2,537 | | CMake | 19,910 | Go | 333,722 | PHP | 249,566 | | COBOL | 5,153 | HTML | 245,228 | Pascal | 28,254 | | CSS | 236,596 | Haskell | 39,848 | Perl | 33,938 | | Ceylon | 8,369 | Haxe | 20,070 | Prolog | 2,058 | | Clojure | 20,765 | Java | 698,421 | Python | 1,797,063 | | Crystal | 24,720 | JavaScript | 530,899 | Racket | 4,340 | | Cuda | 142,344 | Julia | 29,681 | Ruby | 348,976 | | Cython | 22,136 | Kotlin | 292,986 | Rust | 344,491 | | D | 20,338 | Lisp | 29,749 | SML | 19,333 | | Lua | 25,328 | SQL | 493,412 | YAML | 249,011 | | Scala | 273,825 | Scheme | 23,242 | VisualBasic | 13,908 | | Swift | 25,374 | TypeScript | 255,475 | XML | 246,209 | | bc | 249 | dc | 1,713 | | | --- ## Data Format Each dataset entry is structured as follows: ```json { "images": [PIL Image], "texts": [ { "assistant": "<loc_x0><loc_y0><loc_x1><loc_y1><_Language_>CODE_SNIPPET</code>", "source": "SynthCodeNetNoImageTag", "user": "<code>" } ] } ``` --- ## Intended Use * Training multimodal models for **document understanding**, specifically: * Code snippet extraction and transcription --- ## Citation If you use SynthCodeNet, please cite: ```bibtex @article{nassar2025smoldocling, title={SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion}, author={Nassar, Ahmed and Marafioti, Andres and Omenetti, Matteo and Lysak, Maksym and Livathinos, Nikolaos and Auer, Christoph and Morin, Lucas and de Lima, Rafael Teixeira and Kim, Yusik and Gurbuz, A Said and others}, journal={arXiv preprint arXiv:2503.11576}, year={2025} } ```

# SynthCodeNet <div style="display: flex; justify-content: center; align-items: center;"> <img src="https://cdn-uploads.huggingface.co/production/uploads/663e1254887b6f5645a0399f/whc8Bpip5P8uuzZOS0MQJ.png" alt="代码示例" style="width: 500px; height: auto"> </div> **SynthCodeNet** 是一款专为训练SmolDocling模型打造的多模态数据集。该数据集包含超过930万条合成生成的图文对，涵盖来自56种不同编程语言的代码片段。文本数据源自采用宽松许可协议的数据源，而图像则通过LaTeX（LaTeX）和Pygments（Pygments）以120 DPI（每英寸点数，Dots Per Inch）合成生成，以确保视觉多样性。 --- ## 数据集统计信息 * **总样本数**：9,334,257 * **训练集**：8,400,838 * **验证集**：466,703 * **测试集**：466,716 * **模态类型**：图像、文本 * **图像生成方式**：合成生成（LaTeX、Pygments） ### 编程语言与样本数量 | 语言 | 样本数 | 语言 | 样本数 | 语言 | 样本数 | | -------- | ------- | ---------- | ------- | ----------- | --------- | | 阿达语言（Ada） | 20,094 | Dart（Dart） | 20,415 | Matlab（Matlab） | 1,170 | | Awk（Awk） | 22,334 | Dockerfile（Dockerfile） | 99,459 | MoonScript（MoonScript） | 6,237 | | Bash（Bash） | 98,950 | Elixir（Elixir） | 20,387 | Nim（Nim） | 37,236 | | C语言（C） | 599,096 | Erlang（Erlang） | 20,039 | OCaml（OCaml） | 32,297 | | C#语言（C#） | 303,720 | FORTRAN（FORTRAN） | 34,023 | ObjectiveC（ObjectiveC） | 158,398 | | C++语言（C++） | 698,870 | Forth（Forth） | 5,548 | Octave（Octave） | 2,537 | | CMake（CMake） | 19,910 | Go语言（Go） | 333,722 | PHP（PHP） | 249,566 | | COBOL语言（COBOL） | 5,153 | HTML（HTML） | 245,228 | Pascal（Pascal） | 28,254 | | CSS（CSS） | 236,596 | Haskell（Haskell） | 39,848 | Perl（Perl） | 33,938 | | Ceylon（Ceylon） | 8,369 | Haxe（Haxe） | 20,070 | Prolog（Prolog） | 2,058 | | Clojure（Clojure） | 20,765 | Java语言（Java） | 698,421 | Python语言（Python） | 1,797,063 | | Crystal（Crystal） | 24,720 | JavaScript（JavaScript） | 530,899 | Racket（Racket） | 4,340 | | Cuda（Cuda） | 142,344 | Julia（Julia） | 29,681 | Ruby（Ruby） | 348,976 | | Cython（Cython） | 22,136 | Kotlin（Kotlin） | 292,986 | Rust（Rust） | 344,491 | | D语言（D） | 20,338 | Lisp（Lisp） | 29,749 | SML（SML） | 19,333 | | Lua语言（Lua） | 25,328 | SQL（SQL） | 493,412 | YAML（YAML） | 249,011 | | Scala（Scala） | 273,825 | Scheme（Scheme） | 23,242 | VisualBasic（VisualBasic） | 13,908 | | Swift语言（Swift） | 25,374 | TypeScript（TypeScript） | 255,475 | XML（XML） | 246,209 | | bc（bc） | 249 | dc（dc） | 1,713 | | | --- ## 数据格式每条数据集条目结构如下： json { "images": [PIL图像（Python Imaging Library，PIL）], "texts": [ { "assistant": "<loc_x0><loc_y0><loc_x1><loc_y1><_Language_>CODE_SNIPPET</code>", "source": "SynthCodeNetNoImageTag", "user": "<code>" } ] } --- ## 预期用途 * 训练面向**文档理解**的多模态模型，具体包括： * 代码片段提取与转录 --- ## 引用方式若使用SynthCodeNet数据集，请引用以下文献： bibtex @article{nassar2025smoldocling, title={SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion}, author={Nassar, Ahmed and Marafioti, Andres and Omenetti, Matteo and Lysak, Maksym and Livathinos, Nikolaos and Auer, Christoph and Morin, Lucas and de Lima, Rafael Teixeira and Kim, Yusik and Gurbuz, A Said and others}, journal={arXiv preprint arXiv:2503.11576}, year={2025} }

提供机构：

maas

创建时间：

2025-08-01

5,000+

优质数据集

54 个

任务类型

进入经典数据集