five

SynthCodeNet

收藏
魔搭社区2026-01-06 更新2025-08-02 收录
下载链接:
https://modelscope.cn/datasets/ds4sd/SynthCodeNet
下载链接
链接失效反馈
官方服务:
资源简介:
# SynthCodeNet <div style="display: flex; justify-content: center; align-items: center;"> <img src="https://cdn-uploads.huggingface.co/production/uploads/663e1254887b6f5645a0399f/whc8Bpip5P8uuzZOS0MQJ.png" alt="Code Example" style="width: 500px; height: auto"> </div> **SynthCodeNet** is a multimodal dataset created for training the **SmolDocling** model. It consists of over **9.3 million** synthetically generated image-text pairs, covering code snippets from **56** different programming languages. Text data was sourced from permissively licensed sources, while images were synthetically generated at 120 DPI using LaTeX and Pygments to ensure visual diversity. --- ## Dataset Statistics * **Total samples**: 9,334,257 * **Training set**: 8,400,838 * **Validation set**: 466,703 * **Test set**: 466,716 * **Modalities**: Image, Text * **Image Generation**: Synthetic (LaTeX, Pygments) ### Programming Languages & Sample Counts | Language | Samples | Language | Samples | Language | Samples | | -------- | ------- | ---------- | ------- | ----------- | --------- | | Ada | 20,094 | Dart | 20,415 | Matlab | 1,170 | | Awk | 22,334 | Dockerfile | 99,459 | MoonScript | 6,237 | | Bash | 98,950 | Elixir | 20,387 | Nim | 37,236 | | C | 599,096 | Erlang | 20,039 | OCaml | 32,297 | | C# | 303,720 | FORTRAN | 34,023 | ObjectiveC | 158,398 | | C++ | 698,870 | Forth | 5,548 | Octave | 2,537 | | CMake | 19,910 | Go | 333,722 | PHP | 249,566 | | COBOL | 5,153 | HTML | 245,228 | Pascal | 28,254 | | CSS | 236,596 | Haskell | 39,848 | Perl | 33,938 | | Ceylon | 8,369 | Haxe | 20,070 | Prolog | 2,058 | | Clojure | 20,765 | Java | 698,421 | Python | 1,797,063 | | Crystal | 24,720 | JavaScript | 530,899 | Racket | 4,340 | | Cuda | 142,344 | Julia | 29,681 | Ruby | 348,976 | | Cython | 22,136 | Kotlin | 292,986 | Rust | 344,491 | | D | 20,338 | Lisp | 29,749 | SML | 19,333 | | Lua | 25,328 | SQL | 493,412 | YAML | 249,011 | | Scala | 273,825 | Scheme | 23,242 | VisualBasic | 13,908 | | Swift | 25,374 | TypeScript | 255,475 | XML | 246,209 | | bc | 249 | dc | 1,713 | | | --- ## Data Format Each dataset entry is structured as follows: ```json { "images": [PIL Image], "texts": [ { "assistant": "<loc_x0><loc_y0><loc_x1><loc_y1><_Language_>CODE_SNIPPET</code>", "source": "SynthCodeNetNoImageTag", "user": "<code>" } ] } ``` --- ## Intended Use * Training multimodal models for **document understanding**, specifically: * Code snippet extraction and transcription --- ## Citation If you use SynthCodeNet, please cite: ```bibtex @article{nassar2025smoldocling, title={SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion}, author={Nassar, Ahmed and Marafioti, Andres and Omenetti, Matteo and Lysak, Maksym and Livathinos, Nikolaos and Auer, Christoph and Morin, Lucas and de Lima, Rafael Teixeira and Kim, Yusik and Gurbuz, A Said and others}, journal={arXiv preprint arXiv:2503.11576}, year={2025} } ```

# SynthCodeNet <div style="display: flex; justify-content: center; align-items: center;"> <img src="https://cdn-uploads.huggingface.co/production/uploads/663e1254887b6f5645a0399f/whc8Bpip5P8uuzZOS0MQJ.png" alt="代码示例" style="width: 500px; height: auto"> </div> **SynthCodeNet** 是一款专为训练SmolDocling模型打造的多模态数据集。该数据集包含超过930万条合成生成的图文对,涵盖来自56种不同编程语言的代码片段。文本数据源自采用宽松许可协议的数据源,而图像则通过LaTeX(LaTeX)和Pygments(Pygments)以120 DPI(每英寸点数,Dots Per Inch)合成生成,以确保视觉多样性。 --- ## 数据集统计信息 * **总样本数**:9,334,257 * **训练集**:8,400,838 * **验证集**:466,703 * **测试集**:466,716 * **模态类型**:图像、文本 * **图像生成方式**:合成生成(LaTeX、Pygments) ### 编程语言与样本数量 | 语言 | 样本数 | 语言 | 样本数 | 语言 | 样本数 | | -------- | ------- | ---------- | ------- | ----------- | --------- | | 阿达语言(Ada) | 20,094 | Dart(Dart) | 20,415 | Matlab(Matlab) | 1,170 | | Awk(Awk) | 22,334 | Dockerfile(Dockerfile) | 99,459 | MoonScript(MoonScript) | 6,237 | | Bash(Bash) | 98,950 | Elixir(Elixir) | 20,387 | Nim(Nim) | 37,236 | | C语言(C) | 599,096 | Erlang(Erlang) | 20,039 | OCaml(OCaml) | 32,297 | | C#语言(C#) | 303,720 | FORTRAN(FORTRAN) | 34,023 | ObjectiveC(ObjectiveC) | 158,398 | | C++语言(C++) | 698,870 | Forth(Forth) | 5,548 | Octave(Octave) | 2,537 | | CMake(CMake) | 19,910 | Go语言(Go) | 333,722 | PHP(PHP) | 249,566 | | COBOL语言(COBOL) | 5,153 | HTML(HTML) | 245,228 | Pascal(Pascal) | 28,254 | | CSS(CSS) | 236,596 | Haskell(Haskell) | 39,848 | Perl(Perl) | 33,938 | | Ceylon(Ceylon) | 8,369 | Haxe(Haxe) | 20,070 | Prolog(Prolog) | 2,058 | | Clojure(Clojure) | 20,765 | Java语言(Java) | 698,421 | Python语言(Python) | 1,797,063 | | Crystal(Crystal) | 24,720 | JavaScript(JavaScript) | 530,899 | Racket(Racket) | 4,340 | | Cuda(Cuda) | 142,344 | Julia(Julia) | 29,681 | Ruby(Ruby) | 348,976 | | Cython(Cython) | 22,136 | Kotlin(Kotlin) | 292,986 | Rust(Rust) | 344,491 | | D语言(D) | 20,338 | Lisp(Lisp) | 29,749 | SML(SML) | 19,333 | | Lua语言(Lua) | 25,328 | SQL(SQL) | 493,412 | YAML(YAML) | 249,011 | | Scala(Scala) | 273,825 | Scheme(Scheme) | 23,242 | VisualBasic(VisualBasic) | 13,908 | | Swift语言(Swift) | 25,374 | TypeScript(TypeScript) | 255,475 | XML(XML) | 246,209 | | bc(bc) | 249 | dc(dc) | 1,713 | | | --- ## 数据格式 每条数据集条目结构如下: json { "images": [PIL图像(Python Imaging Library,PIL)], "texts": [ { "assistant": "<loc_x0><loc_y0><loc_x1><loc_y1><_Language_>CODE_SNIPPET</code>", "source": "SynthCodeNetNoImageTag", "user": "<code>" } ] } --- ## 预期用途 * 训练面向**文档理解**的多模态模型,具体包括: * 代码片段提取与转录 --- ## 引用方式 若使用SynthCodeNet数据集,请引用以下文献: bibtex @article{nassar2025smoldocling, title={SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion}, author={Nassar, Ahmed and Marafioti, Andres and Omenetti, Matteo and Lysak, Maksym and Livathinos, Nikolaos and Auer, Christoph and Morin, Lucas and de Lima, Rafael Teixeira and Kim, Yusik and Gurbuz, A Said and others}, journal={arXiv preprint arXiv:2503.11576}, year={2025} }
提供机构:
maas
创建时间:
2025-08-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作