Latex-KIE
收藏魔搭社区2025-12-03 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/prithivMLmods/Latex-KIE
下载链接
链接失效反馈官方服务:
资源简介:
# Latex-KIE Dataset
The **Latex-KIE** dataset is a large-scale collection of paired LaTeX formula images and their corresponding LaTeX code. It is specifically designed for training and evaluating models for **Image-to-LaTeX**, **Key Information Extraction (KIE)**, and **Optical Character Recognition (OCR)** tasks in scientific domains.
---
## 📊 Dataset Summary
- **Images**: Rendered LaTeX math formulas (black text on white background)
- **Text**: Corresponding raw LaTeX code for each image
- **Split**: `train`
- **Total Samples**: 92,057
- **Format**: Parquet (`.parquet`)
- **Size**: ~439 MB
---
## 🧾 Data Fields
Each data sample consists of:
| Column | Type | Description |
|----------------|------------|------------------------------------------|
| `image` | Image | Rendered image of the LaTeX formula |
| `latex_formula`| `string` | Corresponding LaTeX string representation|
---
## 📂 Example
```json
{
"image": "<Rendered Image of LaTeX>",
"latex_formula": "\\begin{align*} L_{N,M,N} = \\frac{1}{N^d} \\sum ... \\end{align*}"
}
```
---
## 🧠 Use Cases
This dataset is intended for:
- Training models for **Image-to-LaTeX generation**
- Key Information Extraction (KIE) from scientific formulas
- Benchmarking OCR models on scientific/math notation
- Pretraining/fine-tuning Transformer or CNN-based encoders for math-to-text generation
---
# LaTeX-KIE 数据集
**LaTeX-KIE** 数据集是大规模配对LaTeX公式图像及其对应LaTeX代码的集合,专为科研领域的**图像转LaTeX(Image-to-LaTeX)**、**关键信息抽取(Key Information Extraction,KIE)**以及**光学字符识别(Optical Character Recognition,OCR)**任务的模型训练与评估而设计。
---
## 📊 数据集概览
- **图像**:渲染生成的LaTeX数学公式图像(白底黑字)
- **文本**:每张图像对应的原始LaTeX代码
- **划分**:训练集(`train`)
- **总样本数**:92,057
- **存储格式**:Parquet(`.parquet`)
- **数据集大小**:约439 MB
---
## 🧾 数据字段
每个数据样本包含以下字段:
| 列名 | 数据类型 | 描述说明 |
|----------------|------------|------------------------------------------|
| `image` | 图像 | LaTeX公式的渲染图像 |
| `latex_formula`| 字符串(`string`) | 对应的LaTeX字符串表示形式 |
---
## 📂 数据示例
json
{
"image": "<LaTeX公式渲染图像>",
"latex_formula": "\begin{align*} L_{N,M,N} = \frac{1}{N^d} \sum ... \end{align*}"
}
---
## 🧠 应用场景
本数据集适用于:
- 训练**图像转LaTeX生成**模型
- 从科学公式中抽取关键信息(KIE)
- 针对科学/数学符号的OCR模型基准测试
- 针对数学到文本生成任务的Transformer或基于卷积神经网络(CNN)的编码器的预训练与微调
提供机构:
maas
创建时间:
2025-04-22



