im2latex-100k

Name: im2latex-100k
Creator: OpenDataLab
Published: 2026-07-12 03:30:07
License: 暂无描述

OpenDataLab2026-07-12 更新2024-05-09 收录

下载链接：

https://opendatalab.org.cn/OpenDataLab/im2latex-100k

下载链接

链接失效反馈

官方服务：

资源简介：

用于 OpenAI 的 image-2-latex 系统任务的预构建数据集。包括总共约 10 万个公式和图像，分为训练集、验证集和测试集。公式是从此处提供的 LaTeX 源解析的：http://www.cs.cornell.edu/projects/kddcup/datasets.html（最初来自 arXiv）每个图像都是固定大小的 PNG 图像。公式是黑色的，图像的其余部分是透明的。有关相关工具（例如标记器），请查看此存储库：https://github.com/Miffyli/im2latex-dataset 对于预先制作的评估脚本和构建的 im2latex 系统，请查看此存储库：https://github.com/harvardnlp/ im2markup formulas_im2latex.lst 中使用的换行符是 UNIX 样式的换行符 (\n)。使用其他类型的换行符读取文件会导致行数稍有错误（104563 而不是 103558），从而破坏了该数据集使用的结构。 Python 3.x 默认使用运行系统的换行符读取文件，为避免此文件必须使用 newlines="\n" 打开（例如 open("formulas_im2latex.lst", newline="\n")）。

A pre-built dataset for the OpenAI image-2-latex system task. It contains approximately 100,000 formulas and images in total, split into training, validation, and test sets. The formulas are parsed from the LaTeX sources provided at: http://www.cs.cornell.edu/projects/kddcup/datasets.html (originally sourced from arXiv). Each image is a fixed-size PNG image. The formulas are rendered in black, while the rest of the image background is transparent. For relevant tools (e.g., tokenizers), please refer to this repository: https://github.com/Miffyli/im2latex-dataset. For pre-made evaluation scripts and the built im2latex system, please refer to this repository: https://github.com/harvardnlp/im2markup. The line endings used in formulas_im2latex.lst are UNIX-style line breaks ( ). Using other types of line breaks to read the file will result in a minor line count error (104563 instead of 103558), which will break the structure used by this dataset. For Python 3.x, the default behavior when reading files uses the line break convention of the running operating system. To avoid this issue, the file must be opened with the `newline=" "` parameter (e.g., open("formulas_im2latex.lst", newline=" ")).

提供机构：

OpenDataLab

创建时间：

2022-05-23

搜集汇总

数据集介绍

背景与挑战

背景概述

im2latex-100k是一个用于图像到LaTeX公式转换任务的预构建数据集，包含约10万个公式和对应的PNG图像，分为训练集、验证集和测试集。该数据集专为光学字符识别和计算机视觉预训练设计，公式来源于arXiv的LaTeX源，图像中公式为黑色且背景透明，适用于开发image-2-latex系统。数据集由哈佛大学和东芬兰大学于2016年发布，采用CC0 1.0许可证，并提供了相关工具和评估脚本的存储库链接。

以上内容由遇见数据集搜集并总结生成