nn-auto-bench-ds
收藏魔搭社区2026-01-06 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/nanonets/nn-auto-bench-ds
下载链接
链接失效反馈官方服务:
资源简介:
# nn-auto-bench-ds
`nn-auto-bench-ds` is a dataset designed for key information extraction (KIE) and serves as a benchmark dataset for [nn-auto-bench](<BLOGLINK>).
## Dataset Overview
The dataset comprises **1,000 documents**, categorized into the following types:
1. **Invoice**
2. **Receipt**
3. **Passport**
4. **Bank Statement**
The documents are primarily available in English, with some also in German and Arabic. Each document is annotated for key information extraction and specific tasks. The dataset can be used to compute LLM's oneshot performance on KIE tasks.
## Dataset Schema
The dataset includes the following columns:
- **`image_path`**: File path to the document image.
- **`content`**: OCR-extracted text from the image.
- **`accepted`**: Ground truth answer.
- **`Queried_labels`**: Labels, fields, or keys targeted for extraction.
- **`Queried_col_headers`**: Column headers targeted for extraction.
- **`ctx_1`**: OCR text from an example document.
- **`ctx_1_image_path`**: File path to the example document’s image.
- **`ctx_1_accepted`**: Ground truth answer for the example document.
There are total 54 unique fields/keys/labels that we want to extract from the documents.
## Loading the Dataset
To load the dataset in Python using the `datasets` library:
```python
from datasets import load_dataset
dataset = load_dataset("nanonets/nn-auto-bench-ds")
```
## Data Sources
This dataset aggregates information from multiple open-source datasets, including:
1. [German Invoices Dataset](https://huggingface.co/datasets/Aoschu/German_invoices_dataset)
2. [Personal Financial Dataset for India](https://www.kaggle.com/datasets/mehaksingal/personal-financial-dataset-for-india)
3. [RVL-CDIP Invoice Dataset](https://huggingface.co/datasets/chainyo/rvl-cdip-invoice)
4. [FATURA Dataset](https://zenodo.org/records/8261508)
5. [Find It Again](http://l3i-share.univ-lr.fr/2023Finditagain/index.html)
6. [Generated USA Passports Dataset](https://huggingface.co/datasets/TrainingDataPro/generated-usa-passeports-dataset/tree/main/data)
7. [Synthetic Passports Dataset](https://huggingface.co/datasets/UniDataPro/synthetic-passports)
This dataset is valuable for benchmarking key information extraction models and advancing research in document understanding and natural language processing (NLP).
`nn-auto-bench-ds` 是一款面向关键信息抽取(Key Information Extraction, KIE)的数据集,同时也是针对[nn-auto-bench](<BLOGLINK>)的基准测试数据集。
## 数据集概览
数据集包含**1000份文档**,分为以下类别:
1. **发票(Invoice)**
2. **收据(Receipt)**
3. **护照(Passport)**
4. **银行对账单(Bank Statement)**
这些文档主要以英文编写,同时包含部分德语与阿拉伯语文档。每份文档均针对关键信息抽取及特定任务完成标注。该数据集可用于评估大语言模型(Large Language Model, LLM)在关键信息抽取任务上的单样本性能。
## 数据集模式
数据集包含以下字段:
- **`image_path`**:文档图像的文件路径。
- **`content`**:从图像中提取的光学字符识别(Optical Character Recognition, OCR)文本。
- **`accepted`**:基准真值(Ground Truth)。
- **`Queried_labels`**:待抽取的标签、字段或关键词。
- **`Queried_col_headers`**:待抽取的列标题。
- **`ctx_1`**:示例文档的OCR提取文本。
- **`ctx_1_image_path`**:示例文档图像的文件路径。
- **`ctx_1_accepted`**:示例文档的基准真值。
本数据集共包含54种需从文档中抽取的唯一字段、关键词或标签。
## 数据集加载
使用`datasets`库在Python环境中加载该数据集的代码如下:
python
from datasets import load_dataset
dataset = load_dataset("nanonets/nn-auto-bench-ds")
## 数据集来源
本数据集整合了多个开源数据集的信息,具体包括:
1. [德国发票数据集(German Invoices Dataset)](https://huggingface.co/datasets/Aoschu/German_invoices_dataset)
2. [印度个人金融数据集(Personal Financial Dataset for India)](https://www.kaggle.com/datasets/mehaksingal/personal-financial-dataset-for-india)
3. [RVL-CDIP发票数据集(RVL-CDIP Invoice Dataset)](https://huggingface.co/datasets/chainyo/rvl-cdip-invoice)
4. [FATURA数据集(FATURA Dataset)](https://zenodo.org/records/8261508)
5. [Find It Again数据集(Find It Again)](http://l3i-share.univ-lr.fr/2023Finditagain/index.html)
6. [生成式美国护照数据集(Generated USA Passports Dataset)](https://huggingface.co/datasets/TrainingDataPro/generated-usa-passeports-dataset/tree/main/data)
7. [合成护照数据集(Synthetic Passports Dataset)](https://huggingface.co/datasets/UniDataPro/synthetic-passports)
本数据集可用于基准测试关键信息抽取模型,助力文档理解与自然语言处理(Natural Language Processing, NLP)领域的研究发展。
提供机构:
maas
创建时间:
2025-06-13



