docling-project/SynthCodeNet
收藏Hugging Face2025-07-16 更新2025-11-15 收录
下载链接:
https://hf-mirror.com/datasets/docling-project/SynthCodeNet
下载链接
链接失效反馈官方服务:
资源简介:
SynthCodeNet是一个包含超过930万合成图像-文本对的多模态数据集,用于训练SmolDocling模型。这些图像-文本对包含56种不同编程语言的代码片段。图像是使用LaTeX和Pygments合成的,以实现视觉多样性。数据集分为训练集、验证集和测试集,每个数据条目都包含图像和文本信息。
SynthCodeNet is a multimodal dataset containing over 9.3 million synthetic image-text pairs for training the SmolDocling model. These image-text pairs cover code snippets from 56 different programming languages. The images are synthetically generated using LaTeX and Pygments to ensure visual diversity. The dataset is divided into training, validation, and test sets, with each entry including both an image and text information.
提供机构:
docling-project



