dedoc/orientation_columns_dataset

Name: dedoc/orientation_columns_dataset
Creator: dedoc
Published: 2024-08-02 11:22:19
License: 暂无描述

Hugging Face2024-08-02 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/dedoc/orientation_columns_dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - ru - ar - fr - pt - en size_categories: - 10K<n<100K --- ## About dataset The main purpose of this dataset is to train and evaluate the model used for defining the orientation of the document and the number of text columns in it. As for the model, we chose EffecientNet B0. We constructed this dataset to represent the variety of documents we usually deal with. It contains open data in the form of scientific papers, legal acts, reports, tables, etc. The languages represented in this dataset are: Russian, English, French, Spanish, Portuguese, Arabic, Armenian, Chinese, Georgian, Greek, Italian, Japanese, Korean, and Mongolian. More specifically, it contains 2426 one-column source documents and 1695 multiple-column source documents. These source files are then rotated at four possible angles to cover all possible orientations (0, 90, 180, and 270 degrees). Formally, document orientation is the angle by which a text document has been rotated relative to its vertical position (the one in which a person can read it). We consider four possible orientations: 0 (vertical position), 90, 180, and 270 degrees. A document is considered a one-column document if most of the text in it is arranged in one column. Similarly, a document is considered a multi-column document if most of the text is divided into two columns. ## Description The initial repository structure goes as follows: ``` └─orientation_columns_dataset ├─.gitattributes ├─README.md └─generate_dataset_orient_classifier.zip ``` The structure of the `generate_dataset_orient_classifier.zip` archive after unzipping goes as follows: ``` └─generate_dataset_orient_classifier └─src ├─one_column └─miltiple_column ├─README.md └─sctipts ├─gen_dataset.py └─get_imgs_from_pdf.py ``` Folders `one_column` and `miltiple_column` above contain source pictures for the dataset. `one_column` folder contains documents with only one text column, and the `multiple_column` folder contains documents with two columns of text. After using the generation scripts `gen_dataset.py` and `get_imgs_from_pdf.py`, you will get the dataset in its final form, which can be used for training and evaluation of the model. The structure of the output dataset folder should look as follows: ``` └─columns_orientation_dataset ├─test └─train ``` Both the `train` and `test` folders above contain rotated document pictures and files with the name `labels.csv`. These are the dataset markup tables with columns `image_name`,`orientation` and `columns` that represent all the necessary information about the dataset documents. These markup files are generated automatically. ## About generation scripts: * `scripts/gen_dataset.py` - generates an output dataset for model training and testing. It rotates document images and creates a `label.csv` markup file in each dataset * `-i`, `--input_path_img`: source folder absolute path * `-o`, `--output_path_img`: absolute path for output folder * `-l`, `--output_path_lbl`: absolute path for label file, by default it is contained in output folders * `scripts/get_imgs_from_pdf.py` - just to help if you want to add images to the src folder from different pdfs * `-i`, `--input_path_img`: source folder absolute path * `-o`, `--output_path_img`: absolute path for output folder

提供机构：

dedoc

5,000+

优质数据集

54 个

任务类型

进入经典数据集