shreyansh1347/GNHK-Synthetic-OCR-Dataset

Name: shreyansh1347/GNHK-Synthetic-OCR-Dataset
Creator: shreyansh1347
Published: 2024-02-01 12:55:57
License: 暂无描述

Hugging Face2024-02-01 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/shreyansh1347/GNHK-Synthetic-OCR-Dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: id dtype: string - name: Image dtype: Image - name: ocr_text dtype: string - name: bbox_data dtype: string - name: conversation list: - name: Question dtype: string - name: Answer dtype: string - name: description dtype: string - name: complex_reasoning struct: - name: Question dtype: string - name: Answer dtype: string configs: - config_name: default data_files: - split: test path: dataset.parquet --- # GNHK Synthetic OCR Dataset ## Overview Welcome to the GNHK Synthetic OCR Dataset repository. Here I have generated synthetic data using [GNHK Dataset](https://github.com/GoodNotes/GNHK-dataset), and Open Source LLMs like Mixtral. The dataset contains queries on the images and their answers. ## What's Inside? - **Dataset Folder:** The Dataset Folder contains the images, and corresponding to each image, there is a JSON file which carries the ocr information of that image - **Parquet File:** For easy handling and analysis, the processed dataset is saved as a Parquet file (`dataset.parquet`). This file contains images, their OCR text, one probable question per image, and its likely answer. # Methodology for Generation ## ParseJSON.ipynb This Python notebook interacts with a dataset provided by GNHK, stored on Google Drive. The dataset consists of images, each accompanied by a JSON file containing OCR information for that image. The purpose of ParseJSON is to extract information from these JSON files, convert it into text files, and store these files in a folder named `parsed_dataset` on the same Google Drive. ### What does it parse to? - **ocr_data**: It extracts OCR texts for words based on their 'line_index' and organizes them to represent the OCR text of the given image. - **bbox_data**: Another text file is generated by the parser, structuring information in this format: `word: [[x0, y0], [x1, y1], [x2, y2], [x3, y3]]` (where x0, y1, etc. are coordinates of bounding boxes) ### Why do we need a parser? The parser is necessary because models require OCR data and bounding boxes as input. If this information is in JSON format, creating a prompt for the models becomes complex and may lead to confusion, resulting in undesirable outputs. The parser simplifies the process by converting the data into easily understandable text files. ## 2. DatasetGeneration.ipynb This notebook is the central tool for creating the dataset. In summary, it leverages OCR data and bounding boxes to prompt open-source LLMs, generating query-output tuples. The methodology draws inspiration from the paper on [Visual Instruction Tuning](https://arxiv.org/abs/2304.08485), which outlines the creation of three types of query-output tuples: 1. **Conversation Based:** Simple question-answer pairs related to the given image, covering a broad range of straightforward inquiries. Multiple conversation-based query-output tuples are generated for a single image to ensure comprehensiveness. 2. **Description:** This is not a typical question-answer pair. In this category the model generates detailed descriptions of the text depicted in the image. 3. **Complex Reasoning Based:** These questions delve deeper, requiring thoughtful consideration. Answering them involves understanding the visual content, followed by applying background knowledge or reasoning to provide a detailed response. Only one question-answer tuple of this nature is generated for each image. ## Output Parsing and Cleaning Functions Various parsers are implemented to process the model-generated output. Due to the unpredictable nature of LLM outputs, these parsers aren't flawless. However, by incorporating few-shot prompting and identifying common patterns in the LLM outputs, these parsers can handle a significant number of cases. Their primary function is to convert the raw output into a structured format for inclusion in the final database. Finally, the dataset generated has the following format: ``` [{ "id": id, "Image": Image, "ocr_text": data, "bbox_data": string, "conversation": [ { "Question": question, "Answer": answer } ], "description": string, "complex_reasoning": { "Question": question, "Answer": answer } }] ``` ### Model Used After multiple experiments, the most promising results were achieved using the [Mixtral_8x7b](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1) model. It demonstrated superior performance compared to Llama-2 70b for the specific task at hand. To execute these open-source models in the cloud, the services offered by Together.ai have been employed. ## Post Processing In this experiment, the output generated from two Language Models (LLMs) was processed to enhance the dataset quality. The LLMs used were [Platypus2](https://huggingface.co/garage-bAInd/Platypus2-70B-instruct) and [Mixtral_8x7b](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1). The process involved the following steps: ### Step 1: 1. **Generation and Evaluation:** Mixtral_8x7b generated the initial dataset, which was then evaluated and modified by Platypus2. Subsequently, the output from Platypus2 was further evaluated and modified by Mixtral_8x7b. ### Step 2: 2. **Judgment and Selection:** The outputs from both Mixtral_8x7b (final output of step 1) and Platypus2 (intermediate output of step 1) were assessed by [Mixtral_8x7b_Instruct](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1). The best output was selected, and the dataset was updated accordingly. The pipeline can be summarized as follows: ``` Step 1: Mixtral_8x7b generates dataset --> Platypus2 evaluates and make changes --> Mixtral_8x7b evaluates it's changes Step 2: Mixtral_8x7b output (from Step 1's evaluation stage) --> Mixtral_8x7b_Instruct | Platypus2 output (from Step 1) ``` The resulting dataset, after this process, is named `post_processed_dataset.parquet`. Please note that only 50 data points were post-processed as part of this experiment. **Note:** While this post-processing experiment aimed to enhance the dataset's overall quality, manual observations did not reveal significant improvements.

提供机构：

shreyansh1347

原始信息汇总

GNHK Synthetic OCR Dataset

概述

GNHK Synthetic OCR Dataset 是一个使用 GNHK Dataset 和开源大型语言模型（如 Mixtral）生成的合成数据集。该数据集包含图像查询及其答案。

数据集内容

数据集文件夹： 包含图像及其对应的 JSON 文件，JSON 文件包含图像的 OCR 信息。
Parquet 文件： 处理后的数据集保存为 Parquet 文件 (dataset.parquet)，包含图像、OCR 文本、每个图像的一个可能问题及其可能答案。

数据集生成方法

ParseJSON.ipynb

该 Python 笔记本与存储在 Google Drive 上的 GNHK 数据集交互。数据集包含图像及其对应的 JSON 文件，JSON 文件包含 OCR 信息。ParseJSON 笔记本的目的是从这些 JSON 文件中提取信息，转换为文本文件，并存储在 Google Drive 上的 parsed_dataset 文件夹中。

解析内容

ocr_data： 根据 line_index 提取 OCR 文本，并组织成给定图像的 OCR 文本。
bbox_data： 生成另一个文本文件，格式为：word: [[x0, y0], [x1, y1], [x2, y2], [x3, y3]]（其中 x0, y1 等是边界框的坐标）。

DatasetGeneration.ipynb

该笔记本是创建数据集的核心工具。它利用 OCR 数据和边界框来提示开源大型语言模型，生成查询-输出元组。

查询-输出元组类型

基于对话： 与给定图像相关的简单问答对，涵盖广泛的直接查询。
描述： 模型生成图像中文字的详细描述，不是典型的问答对。
复杂推理： 深入的问题，需要对视觉内容进行理解，并应用背景知识或推理来提供详细回答。

输出解析和清理函数

实现多种解析器来处理模型生成的输出。这些解析器通过少样本提示和识别 LLM 输出的常见模式，能够处理大量情况。它们的主要功能是将原始输出转换为结构化格式，以便包含在最终数据库中。

数据集格式

json [{ "id": id, "Image": Image, "ocr_text": data, "bbox_data": string, "conversation": [ { "Question": question, "Answer": answer } ], "description": string, "complex_reasoning": { "Question": question, "Answer": answer } }]

模型使用

经过多次实验，使用 Mixtral_8x7b 模型取得了最佳结果。它在特定任务上表现优于 Llama-2 70b。

后处理

在该实验中，使用 Platypus2 和 Mixtral_8x7b 两个语言模型对生成的输出进行处理，以提高数据集质量。

处理步骤

生成和评估： Mixtral_8x7b 生成初始数据集，由 Platypus2 评估和修改，然后由 Mixtral_8x7b 再次评估其修改。
判断和选择： Mixtral_8x7b（步骤1的最终输出）和 Platypus2（步骤1的中间输出）的输出由 Mixtral_8x7b_Instruct 评估，选择最佳输出并更新数据集。

最终生成的数据集名为 post_processed_dataset.parquet。请注意，作为该实验的一部分，仅对 50 个数据点进行了后处理。

5,000+

优质数据集

54 个

任务类型

进入经典数据集