ds4sd/SynthCodeNet

Name: ds4sd/SynthCodeNet
Creator: ds4sd
Published: 2025-07-16 07:15:17
License: 暂无描述

Hugging Face2025-07-16 更新2025-10-25 收录

下载链接：

https://hf-mirror.com/datasets/ds4sd/SynthCodeNet

下载链接

链接失效反馈

官方服务：

资源简介：

SynthCodeNet是一个包含超过930万合成图像-文本对的多模态数据集，用于训练SmolDocling模型。这些图像-文本对涵盖了56种不同的编程语言代码片段。数据集中的图像是使用LaTeX和Pygments合成，以实现视觉多样性。该数据集分为训练集、验证集和测试集，样本总数分别为8,400,838、466,703和466,716。

SynthCodeNet is a multimodal dataset containing over 9.3 million synthetic image-text pairs for training the SmolDocling model. These image-text pairs cover code snippets from 56 different programming languages. The images in the dataset are synthetically generated using LaTeX and Pygments to ensure visual diversity. The dataset is divided into training, validation, and test sets with a total of 8,400,838, 466,703, and 466,716 samples respectively.

提供机构：

ds4sd

5,000+

优质数据集

54 个

任务类型

进入经典数据集