rootsautomation/TABMEpp

Name: rootsautomation/TABMEpp
Creator: rootsautomation
Published: 2024-08-23 14:23:18
License: 暂无描述

Hugging Face2024-08-23 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/rootsautomation/TABMEpp

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: mit size_categories: - 100K<n<1M task_categories: - text-classification - image-classification pretty_name: TABME++ dataset_info: features: - name: doc_id dtype: string - name: pg_id dtype: int32 - name: ocr dtype: string - name: img dtype: binary splits: - name: train num_bytes: 14584489453 num_examples: 110132 - name: val num_bytes: 796081484 num_examples: 6089 - name: test num_bytes: 812403045 num_examples: 6237 download_size: 11207258731 dataset_size: 16192973982 configs: - config_name: default data_files: - split: train path: data/train-* - split: val path: data/val-* - split: test path: data/test-* --- # Dataset Card for TABME++ The TABME dataset is a synthetic collection of business document folders generated from the Truth Tobacco Industry Documents archive, with preprocessing and OCR results included, designed to simulate real-world digitization tasks. TABME++ extends TABME by enriching it with commercial-quality OCR (Microsoft OCR). ## Dataset Details ### Dataset Description The TABME dataset is a synthetic collection created to simulate the digitization of business documents, derived from a portion of the Truth Tobacco Industry Documents (TTID) archive. The dataset was constructed by sampling 44,769 PDF documents, excluding corrupted files and those longer than 20 pages, and then preprocessing them by cropping margins, converting them to grayscale, and resizing to 1,000 pixels. To mimic real-world scenarios, folders of documents were generated using a Poisson distribution with 𝜆 = 11, leading to a mean folder length of around 30 pages. The dataset was split into training, validation, and test sets, with OCR preprocessing performed using the Tesseract engine. The dataset includes 100,000 folders for training, 5,000 for validation, and 5,000 for testing, and the results include recognized words, their coordinates, and confidence levels. TABME++ replaces the previous OCR with commericial-quality OCR obtained through Microsoft's OCR services. - **Curated by:** UCSF, UCL, University of Cambridge, Vector.ai, Roots Automation - **Language(s) (NLP):** English - **License:** MIT ### Dataset Sources - **Repository:** - [Original TABME release](https://github.com/aldolipani/TABME) - **Paper:** - [Tab this Folder of Documents: Page Stream Segmentation of Business Documents](https://dl.acm.org/doi/10.1145/3558100.3563852) ## Uses ### Direct Use This dataset is intended to be used for page stream segmentation: the segmentation of a stream of ordered pages into coherent atomic documents. ## Dataset Structure Each row of the dataset corresponds to one page of one document. Each page has the following features: - `doc_id`, str: The unique document id this page belongs to - `pg_id`, int: The page id within its document - `ocr`, str: A string containing the OCR annotations from Microsoft OCR. These can be loaded as a Python dictionary with `json.loads` (or equivalent). - `img`, binary: The raw bytes of the page image. These can be converted back to a PIL.Image with `Image.open(io.BytesIO(bytes))` (or equivalent). This dataset is given such that each document appears once. To build out the full aggregated synthetic streams, one needs to collate the unique documents according to the streams described in the [streams sub-folder](https://huggingface.co/datasets/rootsautomation/TABMEpp/tree/main/streams). ## Dataset Creation ### Curation Rationale The original data, Truth Tobacco Industry Documents archive (formerly known as Legacy Tobacco Documents Library), was curated by researchers at UCSF. This was to promote the study of information retrieval and analysis of business documents. TABME was curated to promote research on page stream segmentation, a core task in automated document processing. TABME++ improves upon TABME by adding higher-quality OCR annotations, but is still curated for the same purposes. ### Source Data From the [UCSF Library](https://www.industrydocuments.ucsf.edu/tobacco/): > Truth Tobacco Industry Documents (formerly known as Legacy Tobacco Documents Library) was created in 2002 by the UCSF Library. It was built to house and provide permanent access to tobacco industry internal corporate documents produced during litigation between US States and the seven major tobacco industry organizations and other sources. These internal documents give a view into the workings of one of the largest and most influential industries in the United States. ## Citation **BibTeX:** ``` @misc{heidenreich2024largelanguagemodelspage, title={Large Language Models for Page Stream Segmentation}, author={Hunter Heidenreich and Ratish Dalvi and Rohith Mukku and Nikhil Verma and Neven Pičuljan}, year={2024}, eprint={2408.11981}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2408.11981}, } ``` ## Dataset Card Authors Hunter Heidenreich

提供机构：

rootsautomation

5,000+

优质数据集

54 个

任务类型

进入经典数据集