five

dreeseaw/SGOCR

收藏
Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/dreeseaw/SGOCR
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other task_categories: - visual-question-answering - image-to-text tags: - ocr - visual-question-answering - text-recognition - document-understanding - scene-text - synthetic-data - chartqa - textocr - coco-text - metadata-only size_categories: - 1K<n<10K pretty_name: SGOCR --- # SGOCR SGOCR is a spatially-grounded OCR visual question answering dataset for training and evaluating models that must read, localize, and reason about text in images. The dataset contains grounded question-answer pairs over ChartQA, TextOCR, and COCO/COCO-Text source images. It is designed for OCR-aware VQA, text grounding, region-conditioned QA, and data-centric experiments around scene text understanding. **Project repository:** https://github.com/cothogonal/sgocr-dataset-pipeline **Release:** `v1.0.0` ## What This Dataset Is For | Use case | Why SGOCR helps | |---|---| | OCR visual question answering | Questions require reading visible text from image regions, not just global image classification. | | Text grounding and localization | Rows include source image IDs, text polygons, text boxes, anchor labels, and anchor boxes. | | Multimodal data augmentation | QA pairs are generated over existing image corpora and can be joined to local image copies. | | Dataset-agent workflows | The schema is flat JSONL with explicit join keys, source IDs, and column-level provenance. | | Error analysis | Quality/provenance fields expose OCR confidence, resolvability, and verification signals. | ## Dataset Snapshot | Split | QA rows | Referenced images | Image policy | |---|---:|---:|---| | `train` | 5,737 | 1,958 | Metadata-only; users join against upstream images | ## Source Mix | Source | QA rows | Images | Join key | |---|---:|---:|---| | ChartQA | 1,964 | 754 | `source_image_id` is the ChartQA image stem | | TextOCR | 1,786 | 634 | `source_image_id` is the TextOCR image id | | COCO / COCO-Text | 1,987 | 570 | `source_image_id` is the COCO train2014 numeric id | ## Question Types | Type | Rows | Task | |---|---:|---| | `DIRECT_READ` | 2,778 | Read text from a grounded region. | | `YES_NO` | 1,635 | Verify whether a region contains a proposed text/value. | | `TEXT_PROPERTY` | 915 | Answer questions about text properties such as count or case. | | `REVERSE_GROUND` | 409 | Identify an anchor or location from text-region context. | ## Example Samples ### Direct Text Reading ![Direct-read example](assets/examples/example_1.jpg) | Field | Value | |---|---| | Source | `coco_text/train/282357` | | Question | What does the sign say, specifically the upper-left text around the top-center area of the image? | | Answer | `BHAR` | | Grounding | anchor=`sign`, relation=`above` | ### Region Verification ![Yes-no example](assets/examples/example_2.png) | Field | Value | |---|---| | Source | `chartqa/shared/ce1423c017bd7c98ff8c9414f03ac33080d0036910639323ec09b7a413ee5b8c` | | Question | Does the black label's upper-left text near the upper-left area of the image say '25-34'? | | Answer | `No` | | Grounding | anchor=`blue bar`, relation=`on` | ## Files | Path | Description | |---|---| | `data/train.jsonl` | Main QA table. One row per question-answer sample. | | `data/source_images.jsonl` | Image-level join table. One row per referenced source image. | | `metadata/summary.json` | Aggregate generation and quality summary. | | `metadata/source_manifest.json` | Source-image manifest for this release. | | `metadata/run_report.json` | Release provenance metadata. | | `metadata/*.jsonl.gz` | Compressed audit artifacts for deeper inspection. | | `assets/examples/*` | Two small illustrative images used only by this README. | ## Main Schema | Column group | Columns | |---|---| | Identity | `sample_id`, `image_id` | | Source join | `source_dataset`, `source_split`, `source_image_id`, `source_file_name`, `upstream_dataset`, `join_hint` | | QA | `question`, `answer`, `question_type`, `answer_level`, `answer_type`, `answer_source` | | Grounding | `anchor_label`, `ref_label`, `relation`, `text_node_ids`, `text_polygon`, `text_bbox_xywh_*`, `anchor_xyxy_*`, `ref_xyxy_*` | | Image metadata | `image_width`, `image_height` | | Quality/provenance | `ocr_confidence`, `resolvable`, `unique`, `quality_tier`, `inline_frontier_correct`, `teacher_provider`, `teacher_model`, `prompt_variant` | ## Quick Start ```python from datasets import load_dataset ds = load_dataset("dreeseaw/SGOCR", data_files="data/train.jsonl", split="train") print(ds[0]["question"]) print(ds[0]["answer"]) ``` To join images, load `data/source_images.jsonl` or use each row's source columns directly. ```python row = ds[0] print(row["source_dataset"], row["source_split"], row["source_image_id"]) print(row["join_hint"]) ``` ## Joining Source Images SGOCR does not redistribute the full source image corpora. Download the upstream datasets under their own terms, then join with the columns below. | Source | Download | Join procedure | |---|---|---| | ChartQA | Official ChartQA distribution | Match `source_image_id` to the ChartQA image stem, typically `<source_image_id>.png`. | | TextOCR | TextOCR v0.1 train data | Use `TextOCR_0.1_train.json["imgs"][source_image_id]["file_name"]`. | | COCO / COCO-Text | COCO train2014 images and COCO-Text metadata | Format `source_image_id` as `COCO_train2014_<12-digit-id>.jpg`. | ## Recommended Citation If you use SGOCR, please cite or link the project repository: ```text SGOCR dataset pipeline. https://github.com/cothogonal/sgocr-dataset-pipeline ``` ## License And Data Terms The SGOCR annotation files in this repository are provided for research and dataset-development use. The full underlying images are not redistributed here; users must obtain ChartQA, TextOCR, COCO, and COCO-Text from their official sources and comply with each upstream dataset's license and terms of use. The two images under `assets/examples/` are included only as small illustrative README examples. They do not replace the upstream datasets and should be treated as subject to the same upstream source terms as the original images. Downstream users are responsible for verifying that their intended use, redistribution, and joined image storage comply with all applicable upstream licenses.

--- license: 其他 task_categories: - 视觉问答(Visual Question Answering,VQA) - 图像到文本 tags: - 光学字符识别(Optical Character Recognition,OCR) - 视觉问答(Visual Question Answering,VQA) - 文本识别 - 文档理解 - 场景文本 - 合成数据 - ChartQA - textocr - COCO-Text - 仅元数据 size_categories: - 1K<n<10K pretty_name: SGOCR --- # SGOCR SGOCR是一款基于空间锚定的光学字符识别(Optical Character Recognition,OCR)视觉问答(Visual Question Answering,VQA)数据集,用于训练和评估需要读取、定位并推理图像中文本的模型。 该数据集包含源自ChartQA、TextOCR与COCO/COCO-Text源图像的锚定问答对,专为支持OCR的视觉问答、文本锚定、区域条件式问答以及围绕场景文本理解的数据中心实验设计。 **项目仓库:** https://github.com/cothogonal/sgocr-dataset-pipeline **发布版本:** `v1.0.0` ## 本数据集适用场景 | 适用场景 | SGOCR的优势 | |---|---| | OCR视觉问答 | 问题要求读取图像区域中的可见文本,而非仅依赖全局图像分类。 | | 文本锚定与定位 | 数据行包含源图像ID、文本多边形、文本框、锚定标签与锚定框。 | | 多模态数据增强 | 问答对基于现有图像语料库生成,可与本地图像副本关联匹配。 | | 数据集-智能体工作流 | 数据集采用扁平化JSONL格式,包含显式连接键、源ID与列级溯源信息。 | | 错误分析 | 质量/溯源字段提供OCR置信度、可解性与验证信号。 | ## 数据集概览 | 拆分集 | 问答样本数 | 引用图像数 | 图像策略 | |---|---:|---:|---| | `train` | 5,737 | 1,958 | 仅提供元数据;用户需自行关联上游图像 | ## 数据源构成 | 数据源 | 问答样本数 | 图像数 | 连接键 | |---|---:|---:|---| | ChartQA | 1,964 | 754 | `source_image_id` 为ChartQA图像文件名前缀 | | TextOCR | 1,786 | 634 | `source_image_id` 为TextOCR图像ID | | COCO / COCO-Text | 1,987 | 570 | `source_image_id` 为COCO train2014的数字ID | ## 问题类型 | 问题类型 | 样本数 | 任务描述 | |---|---:|---| | `DIRECT_READ` | 2,778 | 从锚定区域读取文本。 | | `YES_NO` | 1,635 | 验证某区域是否包含指定文本/数值。 | | `TEXT_PROPERTY` | 915 | 回答关于文本属性的问题,如数量或大小写。 | | `REVERSE_GROUND` | 409 | 根据文本区域上下文识别锚点或位置。 | ## 示例样本 ### 直接文本读取 ![Direct-read example](assets/examples/example_1.jpg) | 字段 | 取值 | |---|---| | 数据源 | `coco_text/train/282357` | | 问题 | 该图像中上区域、左上角附近的标识文本具体是什么? | | 答案 | `BHAR` | | 锚定信息 | 锚点=`标识`,关系=`上方` | ### 区域验证 ![Yes-no example](assets/examples/example_2.png) | 字段 | 取值 | |---|---| | 数据源 | `chartqa/shared/ce1423c017bd7c98ff8c9414f03ac33080d0036910639323ec09b7a413ee5b8c` | | 问题 | 图像左上角区域附近的黑色标签左上角文本是否为'25-34'? | | 答案 | `No` | | 锚定信息 | 锚点=`蓝色条形`,关系=`位于` | ## 文件说明 | 文件路径 | 描述 | |---|---| | `data/train.jsonl` | 主问答数据表,每行对应一个问答样本。 | | `data/source_images.jsonl` | 图像级连接表,每行对应一个引用的源图像。 | | `metadata/summary.json` | 聚合生成与质量总结文件。 | | `metadata/source_manifest.json` | 本次发布的源图像清单。 | | `metadata/run_report.json` | 发布溯源元数据文件。 | | `metadata/*.jsonl.gz` | 用于深度检查的压缩审计工件。 | | `assets/examples/*` | 本README仅使用的两张小型演示图像。 | ## 主数据Schema | 列组 | 列名 | |---|---| | 身份标识 | `sample_id`, `image_id` | | 源数据连接 | `source_dataset`, `source_split`, `source_image_id`, `source_file_name`, `upstream_dataset`, `join_hint` | | 问答相关 | `question`, `answer`, `question_type`, `answer_level`, `answer_type`, `answer_source` | | 锚定信息 | `anchor_label`, `ref_label`, `relation`, `text_node_ids`, `text_polygon`, `text_bbox_xywh_*`, `anchor_xyxy_*`, `ref_xyxy_*` | | 图像元数据 | `image_width`, `image_height` | | 质量与溯源 | `ocr_confidence`, `resolvable`, `unique`, `quality_tier`, `inline_frontier_correct`, `teacher_provider`, `teacher_model`, `prompt_variant` | ## 快速入门 python from datasets import load_dataset ds = load_dataset("dreeseaw/SGOCR", data_files="data/train.jsonl", split="train") print(ds[0]["question"]) print(ds[0]["answer"]) 如需关联图像,可加载`data/source_images.jsonl`或直接使用每行的源数据列。 python row = ds[0] print(row["source_dataset"], row["source_split"], row["source_image_id"]) print(row["join_hint"]) ## 关联源图像 SGOCR不重新分发完整的源图像语料库。请根据各数据集的官方条款下载上游数据集,再通过以下列进行关联匹配。 | 数据源 | 下载方式 | 关联步骤 | |---|---|---| | ChartQA | 官方ChartQA分发包 | 将`source_image_id`与ChartQA图像文件名前缀匹配,格式通常为`<source_image_id>.png`。 | | TextOCR | TextOCR v0.1训练集 | 使用`TextOCR_0.1_train.json["imgs"][source_image_id]["file_name"]`。 | | COCO / COCO-Text | COCO train2014图像与COCO-Text元数据 | 将`source_image_id`格式化为`COCO_train2014_<12位数字ID>.jpg`。 | ## 推荐引用方式 如使用SGOCR,请引用或链接本项目仓库: text SGOCR dataset pipeline. https://github.com/cothogonal/sgocr-dataset-pipeline ## 许可证与数据条款 本仓库中的SGOCR注释文件仅用于研究与数据集开发。完整的源图像未在此处重新分发;用户需从官方渠道获取ChartQA、TextOCR、COCO与COCO-Text,并遵守各上游数据集的许可证与使用条款。 `assets/examples/`下的两张图像仅作为README的演示示例,不替代上游数据集,应视为与原始图像受相同的上游源条款约束。 下游用户需确保其预期使用、重新分发与关联图像存储符合所有适用的上游许可证要求。
提供机构:
dreeseaw
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作