five

shreyansh1347/GNHK-Synthetic-OCR-Dataset

收藏
Hugging Face2024-02-01 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/shreyansh1347/GNHK-Synthetic-OCR-Dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: id dtype: string - name: Image dtype: Image - name: ocr_text dtype: string - name: bbox_data dtype: string - name: conversation list: - name: Question dtype: string - name: Answer dtype: string - name: description dtype: string - name: complex_reasoning struct: - name: Question dtype: string - name: Answer dtype: string configs: - config_name: default data_files: - split: test path: dataset.parquet --- # GNHK Synthetic OCR Dataset ## Overview Welcome to the GNHK Synthetic OCR Dataset repository. Here I have generated synthetic data using [GNHK Dataset](https://github.com/GoodNotes/GNHK-dataset), and Open Source LLMs like Mixtral. The dataset contains queries on the images and their answers. ## What's Inside? - **Dataset Folder:** The Dataset Folder contains the images, and corresponding to each image, there is a JSON file which carries the ocr information of that image - **Parquet File:** For easy handling and analysis, the processed dataset is saved as a Parquet file (`dataset.parquet`). This file contains images, their OCR text, one probable question per image, and its likely answer. # Methodology for Generation ## ParseJSON.ipynb This Python notebook interacts with a dataset provided by GNHK, stored on Google Drive. The dataset consists of images, each accompanied by a JSON file containing OCR information for that image. The purpose of ParseJSON is to extract information from these JSON files, convert it into text files, and store these files in a folder named `parsed_dataset` on the same Google Drive. ### What does it parse to? - **ocr_data**: It extracts OCR texts for words based on their 'line_index' and organizes them to represent the OCR text of the given image. - **bbox_data**: Another text file is generated by the parser, structuring information in this format: `word: [[x0, y0], [x1, y1], [x2, y2], [x3, y3]]` (where x0, y1, etc. are coordinates of bounding boxes) ### Why do we need a parser? The parser is necessary because models require OCR data and bounding boxes as input. If this information is in JSON format, creating a prompt for the models becomes complex and may lead to confusion, resulting in undesirable outputs. The parser simplifies the process by converting the data into easily understandable text files. ## 2. DatasetGeneration.ipynb This notebook is the central tool for creating the dataset. In summary, it leverages OCR data and bounding boxes to prompt open-source LLMs, generating query-output tuples. The methodology draws inspiration from the paper on [Visual Instruction Tuning](https://arxiv.org/abs/2304.08485), which outlines the creation of three types of query-output tuples: 1. **Conversation Based:** Simple question-answer pairs related to the given image, covering a broad range of straightforward inquiries. Multiple conversation-based query-output tuples are generated for a single image to ensure comprehensiveness. 2. **Description:** This is not a typical question-answer pair. In this category the model generates detailed descriptions of the text depicted in the image. 3. **Complex Reasoning Based:** These questions delve deeper, requiring thoughtful consideration. Answering them involves understanding the visual content, followed by applying background knowledge or reasoning to provide a detailed response. Only one question-answer tuple of this nature is generated for each image. ## Output Parsing and Cleaning Functions Various parsers are implemented to process the model-generated output. Due to the unpredictable nature of LLM outputs, these parsers aren't flawless. However, by incorporating few-shot prompting and identifying common patterns in the LLM outputs, these parsers can handle a significant number of cases. Their primary function is to convert the raw output into a structured format for inclusion in the final database. Finally, the dataset generated has the following format: ``` [{ "id": id, "Image": Image, "ocr_text": data, "bbox_data": string, "conversation": [ { "Question": question, "Answer": answer } ], "description": string, "complex_reasoning": { "Question": question, "Answer": answer } }] ``` ### Model Used After multiple experiments, the most promising results were achieved using the [Mixtral_8x7b](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1) model. It demonstrated superior performance compared to Llama-2 70b for the specific task at hand. To execute these open-source models in the cloud, the services offered by Together.ai have been employed. ## Post Processing In this experiment, the output generated from two Language Models (LLMs) was processed to enhance the dataset quality. The LLMs used were [Platypus2](https://huggingface.co/garage-bAInd/Platypus2-70B-instruct) and [Mixtral_8x7b](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1). The process involved the following steps: ### Step 1: 1. **Generation and Evaluation:** Mixtral_8x7b generated the initial dataset, which was then evaluated and modified by Platypus2. Subsequently, the output from Platypus2 was further evaluated and modified by Mixtral_8x7b. ### Step 2: 2. **Judgment and Selection:** The outputs from both Mixtral_8x7b (final output of step 1) and Platypus2 (intermediate output of step 1) were assessed by [Mixtral_8x7b_Instruct](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1). The best output was selected, and the dataset was updated accordingly. The pipeline can be summarized as follows: ``` Step 1: Mixtral_8x7b generates dataset --> Platypus2 evaluates and make changes --> Mixtral_8x7b evaluates it's changes Step 2: Mixtral_8x7b output (from Step 1's evaluation stage) --> Mixtral_8x7b_Instruct | Platypus2 output (from Step 1) ``` The resulting dataset, after this process, is named `post_processed_dataset.parquet`. Please note that only 50 data points were post-processed as part of this experiment. **Note:** While this post-processing experiment aimed to enhance the dataset's overall quality, manual observations did not reveal significant improvements.
提供机构:
shreyansh1347
原始信息汇总

GNHK Synthetic OCR Dataset

概述

GNHK Synthetic OCR Dataset 是一个使用 GNHK Dataset 和开源大型语言模型(如 Mixtral)生成的合成数据集。该数据集包含图像查询及其答案。

数据集内容

  • 数据集文件夹: 包含图像及其对应的 JSON 文件,JSON 文件包含图像的 OCR 信息。
  • Parquet 文件: 处理后的数据集保存为 Parquet 文件 (dataset.parquet),包含图像、OCR 文本、每个图像的一个可能问题及其可能答案。

数据集生成方法

ParseJSON.ipynb

该 Python 笔记本与存储在 Google Drive 上的 GNHK 数据集交互。数据集包含图像及其对应的 JSON 文件,JSON 文件包含 OCR 信息。ParseJSON 笔记本的目的是从这些 JSON 文件中提取信息,转换为文本文件,并存储在 Google Drive 上的 parsed_dataset 文件夹中。

解析内容

  • ocr_data: 根据 line_index 提取 OCR 文本,并组织成给定图像的 OCR 文本。
  • bbox_data: 生成另一个文本文件,格式为:word: [[x0, y0], [x1, y1], [x2, y2], [x3, y3]](其中 x0, y1 等是边界框的坐标)。

DatasetGeneration.ipynb

该笔记本是创建数据集的核心工具。它利用 OCR 数据和边界框来提示开源大型语言模型,生成查询-输出元组。

查询-输出元组类型

  1. 基于对话: 与给定图像相关的简单问答对,涵盖广泛的直接查询。
  2. 描述: 模型生成图像中文字的详细描述,不是典型的问答对。
  3. 复杂推理: 深入的问题,需要对视觉内容进行理解,并应用背景知识或推理来提供详细回答。

输出解析和清理函数

实现多种解析器来处理模型生成的输出。这些解析器通过少样本提示和识别 LLM 输出的常见模式,能够处理大量情况。它们的主要功能是将原始输出转换为结构化格式,以便包含在最终数据库中。

数据集格式

json [{ "id": id, "Image": Image, "ocr_text": data, "bbox_data": string, "conversation": [ { "Question": question, "Answer": answer } ], "description": string, "complex_reasoning": { "Question": question, "Answer": answer } }]

模型使用

经过多次实验,使用 Mixtral_8x7b 模型取得了最佳结果。它在特定任务上表现优于 Llama-2 70b。

后处理

在该实验中,使用 Platypus2Mixtral_8x7b 两个语言模型对生成的输出进行处理,以提高数据集质量。

处理步骤

  1. 生成和评估: Mixtral_8x7b 生成初始数据集,由 Platypus2 评估和修改,然后由 Mixtral_8x7b 再次评估其修改。
  2. 判断和选择: Mixtral_8x7b(步骤1的最终输出)和 Platypus2(步骤1的中间输出)的输出由 Mixtral_8x7b_Instruct 评估,选择最佳输出并更新数据集。

最终生成的数据集名为 post_processed_dataset.parquet。请注意,作为该实验的一部分,仅对 50 个数据点进行了后处理。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作