Mustafaege/qwen3.5-vision-ocr-v2
收藏Hugging Face2026-03-07 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Mustafaege/qwen3.5-vision-ocr-v2
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
pretty_name: Qwen3.5 Vision OCR Dataset v2
size_categories:
- 100K<n<1M
task_categories:
- image-to-text
tags:
- ocr
- latex
- mathematics
- vision
- multimodal
- image-to-text
- formula-recognition
- handwritten
- printed
- sft
- qwen3
- qwen3.5
- qwen-vl
- qwen2.5-vl
- fine-tuning
- open-source
- expanded-dataset
modality:
- image
- text
annotations_creators:
- machine-generated
language_creators:
- found
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
---
# Qwen3.5 Vision OCR Dataset v2
An expanded LaTeX OCR dataset for Qwen3.5-VL fine-tuning, combining **unsloth/LaTeX_OCR** (1% sample) and the **full linxy/LaTeX_OCR** dataset. Provides 2x more coverage including printed and handwritten formulas, all in Qwen3-VL multimodal messages format.
## Dataset Summary
| Property | Value |
|----------|-------|
| **Total Samples** | ~145K |
| **Train Split** | ~130K |
| **Test Split** | ~15K |
| **Sources** | unsloth/LaTeX_OCR + linxy/LaTeX_OCR (full) |
| **Format** | Qwen3-VL multimodal messages |
| **Task** | Image → LaTeX formula |
| **License** | Apache 2.0 |
## v1 vs v2 Comparison
| Version | Samples | Coverage | Sources |
|---------|---------|----------|---------|
| **v1** | 68,686 | Printed formulas (1% sample) | unsloth/LaTeX_OCR |
| **v2** (this) | ~145K | Printed + full coverage | + linxy/LaTeX_OCR full dataset |
## What's New in v2?
- **2x More Data**: ~145K vs ~68K samples
- **Full Coverage**: Uses complete linxy/LaTeX_OCR (not a subset)
- **More Diversity**: Broader range of formula types and complexities
- **Better Generalization**: Reduced overfitting risk with more unique examples
## Dataset Structure
### Data Fields
| Field | Type | Description |
|-------|------|-------------|
| `messages` | `list[dict]` | Multimodal conversation: user (image + instruction) + assistant (LaTeX) |
### Message Schema
```
messages[0] = {"role": "user", "content": [
{"type": "text", "text": "Write the LaTeX representation for this image."},
{"type": "image", "image": <PIL.Image>}
]}
messages[1] = {"role": "assistant", "content": [
{"type": "text", "text": "<latex_formula>"}
]}
```
## Sources
| Dataset | Config | Samples | Formula Types | Notes |
|---------|--------|---------|---------------|-------|
| [unsloth/LaTeX_OCR](https://huggingface.co/datasets/unsloth/LaTeX_OCR) | default | 68,686 | Printed | 1% sample of linxy full |
| [linxy/LaTeX_OCR](https://huggingface.co/datasets/linxy/LaTeX_OCR) | full | ~76,318 | Printed | Full printed text dataset |
> Source: [LinXueyuanStdio/LaTeX_OCR](https://github.com/LinXueyuanStdio/LaTeX_OCR) — data from Zenodo, CROHME, and custom-built datasets. Validated with LaTeX AST parsing.
## Format
```json
{
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Write the LaTeX representation for this image."
},
{
"type": "image",
"image": "<PIL.PngImagePlugin.PngImageFile image mode=RGB size=320x64>"
}
]
},
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "\\int_{0}^{\\infty} \\frac{x^{s-1}}{e^{x}-1} dx = \\Gamma(s) \\zeta(s)"
}
]
}
]
}
```
## Sample Formula Examples
| Image Description | LaTeX Output |
|------------------|--------------|
| Simple fraction | `\frac{a}{b}` |
| Summation | `\sum_{i=1}^{n} x_i` |
| Integral with limits | `\int_{-\infty}^{\infty} e^{-x^2} dx = \sqrt{\pi}` |
| Matrix | `\begin{pmatrix} a & b \\ c & d \end{pmatrix}` |
| Nested fraction | `\frac{d}{dx}\left(\frac{f(x)}{g(x)}\right)` |
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("Mustafaege/qwen3.5-vision-ocr-v2")
print(dataset)
# DatasetDict({
# train: Dataset({features: ['messages'], num_rows: ~130000}),
# test: Dataset({features: ['messages'], num_rows: ~15000})
# })
# Access image and LaTeX
sample = dataset['train'][0]
image = sample['messages'][0]['content'][1]['image'] # PIL.Image
latex = sample['messages'][1]['content'][0]['text'] # LaTeX string
print(f"LaTeX: {latex}")
```
## Training with Unsloth (VL)
```python
from unsloth import FastVisionModel
from trl import SFTTrainer, SFTConfig
from unsloth import is_bfloat16_supported
model, tokenizer = FastVisionModel.from_pretrained(
model_name = "unsloth/Qwen2-VL-7B-Instruct",
max_seq_length = 2048,
load_in_4bit = True,
)
model = FastVisionModel.get_peft_model(
model,
finetune_vision_layers = True,
finetune_language_layers = True,
r = 16, lora_alpha = 16,
)
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset['train'],
args = SFTConfig(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
max_seq_length = 2048,
use_gradient_checkpointing = "unsloth",
),
)
trainer.train()
```
## Related Datasets
| Version | Samples | Link |
|---------|---------|------|
| **v1** | 68,686 | [Mustafaege/qwen3.5-vision-ocr-v1](https://huggingface.co/datasets/Mustafaege/qwen3.5-vision-ocr-v1) |
| **v2** (this) | ~145K | [Mustafaege/qwen3.5-vision-ocr-v2](https://huggingface.co/datasets/Mustafaege/qwen3.5-vision-ocr-v2) |
## License
Apache 2.0 — see [LICENSE](https://www.apache.org/licenses/LICENSE-2.0) for details.
---
Built for Qwen3.5-VL fine-tuning. Part of the [Mustafaege](https://huggingface.co/Mustafaege) model series.
提供机构:
Mustafaege



