five

lt-asset/tab2latex

收藏
Hugging Face2025-09-21 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/lt-asset/tab2latex
下载链接
链接失效反馈
官方服务:
资源简介:
Tab2Latex是一个LaTeX表格识别数据集,包含87,513个训练实例,5,000个验证实例和5,000个测试实例。该数据集的LaTeX源代码来自计算机科学中六个不同子领域的学术论文,这些子领域包括人工智能、计算与语言、计算机视觉与模式识别、密码学与安全、编程语言和软件工程。数据来源于arXiv仓库,时间跨度为2018年至2023年。数据集通过匹配LaTeX源代码中的egin{tabular}和end{tabular}标签,去除注释,提取表格,然后将LaTeX表格源脚本渲染成PDF格式,并转换为160 dpi的PNG格式图像。数据集中的字段包括实例ID和从LaTeX源代码渲染得到的图像以及LaTeX表格的源代码。

Tab2Latex is a LaTeX table recognition dataset consisting of 87,513 training instances, 5,000 validation instances, and 5,000 test instances. The LaTeX source code is collected from academic papers in six distinct sub-fields of computer science—Artificial Intelligence, Computation and Language, Computer Vision and Pattern Recognition, Cryptography and Security, Programming Languages, and Software Engineering—from the arXiv repository, covering the years 2018 to 2023. The dataset is created by identifying and extracting tables from the LaTeX source code by matching egin{tabular} and end{tabular}, removing comments, and then rendering the LaTeX table source scripts into PDF format and converting them to PNG images at 160 dpi. The data fields include an instance id, the rendered image from the LaTeX source code, and the LaTeX source code for the table.
提供机构:
lt-asset
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作