five

Mustafaege/qwen3.5-vision-ocr-v2

收藏
Hugging Face2026-03-07 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Mustafaege/qwen3.5-vision-ocr-v2
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: apache-2.0 pretty_name: Qwen3.5 Vision OCR Dataset v2 size_categories: - 100K<n<1M task_categories: - image-to-text tags: - ocr - latex - mathematics - vision - multimodal - image-to-text - formula-recognition - handwritten - printed - sft - qwen3 - qwen3.5 - qwen-vl - qwen2.5-vl - fine-tuning - open-source - expanded-dataset modality: - image - text annotations_creators: - machine-generated language_creators: - found configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* --- # Qwen3.5 Vision OCR Dataset v2 An expanded LaTeX OCR dataset for Qwen3.5-VL fine-tuning, combining **unsloth/LaTeX_OCR** (1% sample) and the **full linxy/LaTeX_OCR** dataset. Provides 2x more coverage including printed and handwritten formulas, all in Qwen3-VL multimodal messages format. ## Dataset Summary | Property | Value | |----------|-------| | **Total Samples** | ~145K | | **Train Split** | ~130K | | **Test Split** | ~15K | | **Sources** | unsloth/LaTeX_OCR + linxy/LaTeX_OCR (full) | | **Format** | Qwen3-VL multimodal messages | | **Task** | Image → LaTeX formula | | **License** | Apache 2.0 | ## v1 vs v2 Comparison | Version | Samples | Coverage | Sources | |---------|---------|----------|---------| | **v1** | 68,686 | Printed formulas (1% sample) | unsloth/LaTeX_OCR | | **v2** (this) | ~145K | Printed + full coverage | + linxy/LaTeX_OCR full dataset | ## What's New in v2? - **2x More Data**: ~145K vs ~68K samples - **Full Coverage**: Uses complete linxy/LaTeX_OCR (not a subset) - **More Diversity**: Broader range of formula types and complexities - **Better Generalization**: Reduced overfitting risk with more unique examples ## Dataset Structure ### Data Fields | Field | Type | Description | |-------|------|-------------| | `messages` | `list[dict]` | Multimodal conversation: user (image + instruction) + assistant (LaTeX) | ### Message Schema ``` messages[0] = {"role": "user", "content": [ {"type": "text", "text": "Write the LaTeX representation for this image."}, {"type": "image", "image": <PIL.Image>} ]} messages[1] = {"role": "assistant", "content": [ {"type": "text", "text": "<latex_formula>"} ]} ``` ## Sources | Dataset | Config | Samples | Formula Types | Notes | |---------|--------|---------|---------------|-------| | [unsloth/LaTeX_OCR](https://huggingface.co/datasets/unsloth/LaTeX_OCR) | default | 68,686 | Printed | 1% sample of linxy full | | [linxy/LaTeX_OCR](https://huggingface.co/datasets/linxy/LaTeX_OCR) | full | ~76,318 | Printed | Full printed text dataset | > Source: [LinXueyuanStdio/LaTeX_OCR](https://github.com/LinXueyuanStdio/LaTeX_OCR) — data from Zenodo, CROHME, and custom-built datasets. Validated with LaTeX AST parsing. ## Format ```json { "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Write the LaTeX representation for this image." }, { "type": "image", "image": "<PIL.PngImagePlugin.PngImageFile image mode=RGB size=320x64>" } ] }, { "role": "assistant", "content": [ { "type": "text", "text": "\\int_{0}^{\\infty} \\frac{x^{s-1}}{e^{x}-1} dx = \\Gamma(s) \\zeta(s)" } ] } ] } ``` ## Sample Formula Examples | Image Description | LaTeX Output | |------------------|--------------| | Simple fraction | `\frac{a}{b}` | | Summation | `\sum_{i=1}^{n} x_i` | | Integral with limits | `\int_{-\infty}^{\infty} e^{-x^2} dx = \sqrt{\pi}` | | Matrix | `\begin{pmatrix} a & b \\ c & d \end{pmatrix}` | | Nested fraction | `\frac{d}{dx}\left(\frac{f(x)}{g(x)}\right)` | ## Usage ```python from datasets import load_dataset dataset = load_dataset("Mustafaege/qwen3.5-vision-ocr-v2") print(dataset) # DatasetDict({ # train: Dataset({features: ['messages'], num_rows: ~130000}), # test: Dataset({features: ['messages'], num_rows: ~15000}) # }) # Access image and LaTeX sample = dataset['train'][0] image = sample['messages'][0]['content'][1]['image'] # PIL.Image latex = sample['messages'][1]['content'][0]['text'] # LaTeX string print(f"LaTeX: {latex}") ``` ## Training with Unsloth (VL) ```python from unsloth import FastVisionModel from trl import SFTTrainer, SFTConfig from unsloth import is_bfloat16_supported model, tokenizer = FastVisionModel.from_pretrained( model_name = "unsloth/Qwen2-VL-7B-Instruct", max_seq_length = 2048, load_in_4bit = True, ) model = FastVisionModel.get_peft_model( model, finetune_vision_layers = True, finetune_language_layers = True, r = 16, lora_alpha = 16, ) trainer = SFTTrainer( model = model, tokenizer = tokenizer, train_dataset = dataset['train'], args = SFTConfig( per_device_train_batch_size = 2, gradient_accumulation_steps = 4, fp16 = not is_bfloat16_supported(), bf16 = is_bfloat16_supported(), max_seq_length = 2048, use_gradient_checkpointing = "unsloth", ), ) trainer.train() ``` ## Related Datasets | Version | Samples | Link | |---------|---------|------| | **v1** | 68,686 | [Mustafaege/qwen3.5-vision-ocr-v1](https://huggingface.co/datasets/Mustafaege/qwen3.5-vision-ocr-v1) | | **v2** (this) | ~145K | [Mustafaege/qwen3.5-vision-ocr-v2](https://huggingface.co/datasets/Mustafaege/qwen3.5-vision-ocr-v2) | ## License Apache 2.0 — see [LICENSE](https://www.apache.org/licenses/LICENSE-2.0) for details. --- Built for Qwen3.5-VL fine-tuning. Part of the [Mustafaege](https://huggingface.co/Mustafaege) model series.
提供机构:
Mustafaege
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作