HuggingFaceM4/the_cauldron
收藏Hugging Face2024-05-06 更新2024-04-19 收录
下载链接:
https://hf-mirror.com/datasets/HuggingFaceM4/the_cauldron
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: ai2d
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 435362437.84770346
num_examples: 2434
download_size: 438136609
dataset_size: 435362437.84770346
- config_name: aokvqa
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 871997710.0
num_examples: 16539
download_size: 893265070
dataset_size: 871997710.0
- config_name: chart2text
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 1060566797.2728182
num_examples: 26961
download_size: 1103141721
dataset_size: 1060566797.2728182
- config_name: chartqa
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 784719364.9441738
num_examples: 18265
download_size: 803192402
dataset_size: 784719364.9441738
- config_name: clevr
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 11522617868.0
num_examples: 70000
download_size: 13267429872
dataset_size: 11522617868.0
- config_name: clevr_math
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 13308311206.0
num_examples: 70000
download_size: 16315284
dataset_size: 13308311206.0
- config_name: cocoqa
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 2213960474.0
num_examples: 46287
download_size: 2393991009
dataset_size: 2213960474.0
- config_name: datikz
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 481233278.0
num_examples: 47974
download_size: 613100257
dataset_size: 481233278.0
- config_name: diagram_image_to_text
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 18877197.0
num_examples: 300
download_size: 18706661
dataset_size: 18877197.0
- config_name: docvqa
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 6885686042.0
num_examples: 10189
download_size: 6887803845
dataset_size: 6885686042.0
- config_name: dvqa
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 3689940101.0
num_examples: 200000
download_size: 4295254110
dataset_size: 3689940101.0
- config_name: figureqa
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 1901887152.0
num_examples: 100000
download_size: 2220036667
dataset_size: 1901887152.0
- config_name: finqa
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 135268568.0
num_examples: 5276
download_size: 123698250
dataset_size: 135268568.0
- config_name: geomverse
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 951640204.0
num_examples: 9303
download_size: 323746516
dataset_size: 951640204.0
- config_name: hateful_memes
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 3035059823.0
num_examples: 8500
download_size: 3054208907
dataset_size: 3035059823.0
- config_name: hitab
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 161130580.0
num_examples: 2500
download_size: 158295807
dataset_size: 161130580.0
- config_name: iam
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 1129180352.0
num_examples: 5663
download_size: 1128935602
dataset_size: 1129180352.0
- config_name: iconqa
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 264513634.7170419
num_examples: 27307
download_size: 326674337
dataset_size: 264513634.7170419
- config_name: infographic_vqa
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 291677986.0
num_examples: 2118
download_size: 292351760
dataset_size: 291677986.0
- config_name: intergps
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 24982328.291771192
num_examples: 1280
download_size: 24870320
dataset_size: 24982328.291771192
- config_name: localized_narratives
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 21380844262.41927
num_examples: 199998
download_size: 22164342699
dataset_size: 21380844262.41927
- config_name: mapqa
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 3238062926.0
num_examples: 37417
download_size: 3307676486
dataset_size: 3238062926.0
- config_name: mimic_cgd
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 12592929433.0
num_examples: 70939
download_size: 13147641100
dataset_size: 12592929433.0
- config_name: multihiertt
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 1356766489.046
num_examples: 7619
download_size: 1360814135
dataset_size: 1356766489.046
- config_name: nlvr2
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 8375492591.0
num_examples: 50426
download_size: 10838882020
dataset_size: 8375492591.0
- config_name: ocrvqa
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 5467134439.0
num_examples: 165746
download_size: 6078073015
dataset_size: 5467134439.0
- config_name: okvqa
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 281454288182.492
num_examples: 9009
download_size: 3009062
dataset_size: 281454288182.492
- config_name: plotqa
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 7837605221.0
num_examples: 157070
download_size: 5320249066
dataset_size: 7837605221.0
- config_name: raven
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 1506550467.0
num_examples: 42000
download_size: 1720691636
dataset_size: 1506550467.0
- config_name: rendered_text
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 11086896502.0
num_examples: 10000
download_size: 11086960376
dataset_size: 11086896502.0
- config_name: robut_sqa
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 679135952.0
num_examples: 8514
download_size: 678722272
dataset_size: 679135952.0
- config_name: robut_wikisql
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 5950915477.0
num_examples: 74989
download_size: 6160300141
dataset_size: 5950915477.0
- config_name: robut_wtq
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 4023729236.0
num_examples: 38246
download_size: 4061523247
dataset_size: 4023729236.0
- config_name: scienceqa
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 284601898.76188564
num_examples: 4976
download_size: 283265438
dataset_size: 284601898.76188564
- config_name: screen2words
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 1670723783.0
num_examples: 15730
download_size: 1346254268
dataset_size: 1670723783.0
- config_name: spot_the_diff
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 1643123792.0
num_examples: 8566
download_size: 1526740548
dataset_size: 1643123792.0
- config_name: st_vqa
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 696265340.0
num_examples: 17247
download_size: 720462890
dataset_size: 696265340.0
- config_name: tabmwp
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 265337140.19648907
num_examples: 22722
download_size: 306643610
dataset_size: 265337140.19648907
- config_name: tallyqa
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 4267143189.0
num_examples: 98680
download_size: 4662245152
dataset_size: 4267143189.0
- config_name: tat_qa
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 73213942.0
num_examples: 2199
download_size: 70862028
dataset_size: 73213942.0
- config_name: textcaps
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 5938676115.0
num_examples: 21953
download_size: 6175419911
dataset_size: 5938676115.0
- config_name: textvqa
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 5939437331.0
num_examples: 21953
download_size: 6175442839
dataset_size: 5939437331.0
- config_name: tqa
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 380346870.806369
num_examples: 1493
download_size: 378238311
dataset_size: 380346870.806369
- config_name: vistext
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 541250281.0
num_examples: 9969
download_size: 386023352
dataset_size: 541250281.0
- config_name: visual7w
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 4432168161.0
num_examples: 14366
download_size: 4443083495
dataset_size: 4432168161.0
- config_name: visualmrc
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 2941051627.2639995
num_examples: 3027
download_size: 2912911810
dataset_size: 2941051627.2639995
- config_name: vqarad
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 16561537.0
num_examples: 313
download_size: 16226241
dataset_size: 16561537.0
- config_name: vqav2
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 10630091683.0
num_examples: 82772
download_size: 13479302437
dataset_size: 10630091683.0
- config_name: vsr
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 107489763.0
num_examples: 2157
download_size: 107576214
dataset_size: 107489763.0
- config_name: websight
features:
- name: images
sequence: image
- name: texts
list:
- name: user
dtype: string
- name: assistant
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 2011365901.0
num_examples: 10000
download_size: 1601222161
dataset_size: 2011365901.0
configs:
- config_name: ai2d
data_files:
- split: train
path: ai2d/train-*
- config_name: aokvqa
data_files:
- split: train
path: aokvqa/train-*
- config_name: chart2text
data_files:
- split: train
path: chart2text/train-*
- config_name: chartqa
data_files:
- split: train
path: chartqa/train-*
- config_name: clevr
data_files:
- split: train
path: clevr/train-*
- config_name: clevr_math
data_files:
- split: train
path: clevr_math/train-*
- config_name: cocoqa
data_files:
- split: train
path: cocoqa/train-*
- config_name: datikz
data_files:
- split: train
path: datikz/train-*
- config_name: diagram_image_to_text
data_files:
- split: train
path: diagram_image_to_text/train-*
- config_name: docvqa
data_files:
- split: train
path: docvqa/train-*
- config_name: dvqa
data_files:
- split: train
path: dvqa/train-*
- config_name: figureqa
data_files:
- split: train
path: figureqa/train-*
- config_name: finqa
data_files:
- split: train
path: finqa/train-*
- config_name: geomverse
data_files:
- split: train
path: geomverse/train-*
- config_name: hateful_memes
data_files:
- split: train
path: hateful_memes/train-*
- config_name: hitab
data_files:
- split: train
path: hitab/train-*
- config_name: iam
data_files:
- split: train
path: iam/train-*
- config_name: iconqa
data_files:
- split: train
path: iconqa/train-*
- config_name: infographic_vqa
data_files:
- split: train
path: infographic_vqa/train-*
- config_name: intergps
data_files:
- split: train
path: intergps/train-*
- config_name: localized_narratives
data_files:
- split: train
path: localized_narratives/train-*
- config_name: mapqa
data_files:
- split: train
path: mapqa/train-*
- config_name: mimic_cgd
data_files:
- split: train
path: mimic_cgd/train-*
- config_name: multihiertt
data_files:
- split: train
path: multihiertt/train-*
- config_name: nlvr2
data_files:
- split: train
path: nlvr2/train-*
- config_name: ocrvqa
data_files:
- split: train
path: ocrvqa/train-*
- config_name: okvqa
data_files:
- split: train
path: okvqa/train-*
- config_name: plotqa
data_files:
- split: train
path: plotqa/train-*
- config_name: raven
data_files:
- split: train
path: raven/train-*
- config_name: rendered_text
data_files:
- split: train
path: rendered_text/train-*
- config_name: robut_sqa
data_files:
- split: train
path: robut_sqa/train-*
- config_name: robut_wikisql
data_files:
- split: train
path: robut_wikisql/train-*
- config_name: robut_wtq
data_files:
- split: train
path: robut_wtq/train-*
- config_name: scienceqa
data_files:
- split: train
path: scienceqa/train-*
- config_name: screen2words
data_files:
- split: train
path: screen2words/train-*
- config_name: spot_the_diff
data_files:
- split: train
path: spot_the_diff/train-*
- config_name: st_vqa
data_files:
- split: train
path: st_vqa/train-*
- config_name: tabmwp
data_files:
- split: train
path: tabmwp/train-*
- config_name: tallyqa
data_files:
- split: train
path: tallyqa/train-*
- config_name: tat_qa
data_files:
- split: train
path: tat_qa/train-*
- config_name: textcaps
data_files:
- split: train
path: textcaps/train-*
- config_name: textvqa
data_files:
- split: train
path: textvqa/train-*
- config_name: tqa
data_files:
- split: train
path: tqa/train-*
- config_name: vistext
data_files:
- split: train
path: vistext/train-*
- config_name: visual7w
data_files:
- split: train
path: visual7w/train-*
- config_name: visualmrc
data_files:
- split: train
path: visualmrc/train-*
- config_name: vqarad
data_files:
- split: train
path: vqarad/train-*
- config_name: vqav2
data_files:
- split: train
path: vqav2/train-*
- config_name: vsr
data_files:
- split: train
path: vsr/train-*
- config_name: websight
data_files:
- split: train
path: websight/train-*
---
# Dataset Card for The Cauldron

## Dataset description
The Cauldron is part of the Idefics2 release.
It is a massive collection of 50 vision-language datasets (training sets only) that were used for the fine-tuning of the vision-language model Idefics2.
## Load the dataset
To load the dataset, install the library `datasets` with `pip install datasets`. Then,
```
from datasets import load_dataset
ds = load_dataset("HuggingFaceM4/the_cauldron", "ai2d")
```
to download and load the config `ai2d` for example.
## Data fields
An example of a sample looks as follows:
```
{
"images" = [PIL.Image]
"texts" = [
{
"user": "Question: How many actions are depicted in the diagram?\nChoices:\nA. 6.\nB. 4.\nC. 8.\nD. 7.\nAnswer with the letter.",
"assistant": "Answer: D",
"source": "TQA"
}
]
}
```
In `images`, there is a list of images, to be placed before the text.
In `texts`, there is a conversation between a user and an assistant about the images that is represented by a list of turns.
## Stats about the datasets in The Cauldron
| Dataset | # images | # Q/A pairs | # tokens |
|----------------------|----------|-------------|------------|
| *General visual question answering* |
| VQAv2 | 82,772 | 443,757 | 1,595,929 |
| COCO-QA | 46,287 | 78,736 | 286,982 |
| Visual7W | 14,366 | 69,817 | 279,268 |
| A-OKVQA | 16,539 | 17,056 | 236,492 |
| TallyQA | 98,680 | 183,986 | 738,254 |
| OK-VQA | 8,998 | 9,009 | 38,853 |
| HatefulMemes | 8,500 | 8,500 | 25,500 |
| VQA-RAD | 313 | 1,793 | 8,418 |
| Captioning |
| LNarratives | 507,444 | 507,444 | 21,328,731 |
| Screen2Words | 15,730 | 15,743 | 143,103 |
| VSR | 2,157 | 3,354 | 10,062 |
| *OCR, document understanding, text transcription* |
| RenderedText | 999,000 | 999,000 | 27,207,774 |
| DocVQA | 10,189 | 39,463 | 337,829 |
| TextCaps | 21,953 | 21,953 | 389,658 |
| TextVQA | 21,953 | 34,602 | 181,918 |
| ST-VQA | 17,247 | 23,121 | 127,846 |
| OCR-VQA | 165,746 | 801,579 | 6,073,824 |
| VisualMRC | 3,027 | 11,988 | 168,828 |
| IAM | 5,663 | 5,663 | 144,216 |
| InfoVQA | 2,118 | 10,074 | 61,048 |
| Diagram image-to-text| 300 | 300 | 22,196 |
| *Chart/figure understanding* |
| Chart2Text | 26,985 | 30,242 | 2,852,827 |
| DVQA | 200,000 | 2,325,316 | 8,346,234 |
| VisText | 7,057 | 9,969 | 1,245,485 |
| ChartQA | 18,271 | 28,299 | 185,835 |
| PlotQA | 157,070 | 20,249,479 | 8478299.278|
| FigureQA | 100,000 | 1,327,368 | 3,982,104 |
| MapQA | 37,417 | 483,416 | 6,470,485 |
| *Table understanding* |
| TabMWP | 22,729 | 23,059 | 1,948,166 |
| TAT-QA | 2,199 | 13,215 | 283,776 |
| HiTab | 2,500 | 7,782 | 351,299 |
| MultiHiertt | 7,619 | 7,830 | 267,615 |
| FinQA | 5,276 | 6,251 | 242,561 |
| WikiSQL | 74,989 | 86,202 | 9,680,673 |
| SQA | 8,514 | 34,141 | 1,894,824 |
| WTQ | 38,246 | 44,096 | 6,677,013 |
| *Reasoning, logic, maths* |
| GeomVerse | 9,303 | 9,339 | 2,489,459 |
| CLEVR-Math | 70,000 | 788,650 | 3,184,656 |
| CLEVR | 70,000 | 699,989 | 2,396,781 |
| IconQA | 27,315 | 29,859 | 112,969 |
| RAVEN | 42,000 | 42,000 | 105,081 |
| Inter-GPs | 1,451 | 2,101 | 8,404 |
| *Textbook/academic questions* |
| AI2D | 3,099 | 9,708 | 38,832 |
| TQA | 1,496 | 6,501 | 26,004 |
| ScienceQA | 4,985 | 6,218 | 24,872 |
| *Differences between 2 images* |
| NLVR2 | 50,426 | 86,373 | 259,119 |
| GSD | 70,939 | 141,869 | 4,637,229 |
| Spot the diff | 8,566 | 9,524 | 221,477 |
| *Screenshot to code* |
| WebSight | 500,000 | 500,000 | 276,743,299|
| DaTikz | 47,974 | 48,296 | 59,556,252 |
## Decontamination
The Cauldron contains only the train split of each sub-datasets.
On top of that, we removed the few examples containing an image also present in the test splits of MMMU, MathVista or MMBench.
## References to the original datasets
<details>
<summary>References to the original datasets</summary>
@misc{AI2D,
title={A Diagram Is Worth A Dozen Images},
author={Aniruddha Kembhavi and Mike Salvato and Eric Kolve and Minjoon Seo and Hannaneh Hajishirzi and Ali Farhadi},
year={2016},
eprint={1603.07396},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{A-OKVQA,
title={A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge},
author={Dustin Schwenk and Apoorv Khandelwal and Christopher Clark and Kenneth Marino and Roozbeh Mottaghi},
year={2022},
eprint={2206.01718},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@inproceedings{Chart2Text,
title = "Chart-to-Text: Generating Natural Language Descriptions for Charts by Adapting the Transformer Model",
author = "Obeid, Jason and
Hoque, Enamul",
editor = "Davis, Brian and
Graham, Yvette and
Kelleher, John and
Sripada, Yaji",
booktitle = "Proceedings of the 13th International Conference on Natural Language Generation",
month = dec,
year = "2020",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.inlg-1.20",
doi = "10.18653/v1/2020.inlg-1.20",
pages = "138--147",
}
@inproceedings{ChartQA,
title = "{C}hart{QA}: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning",
author = "Masry, Ahmed and
Long, Do and
Tan, Jia Qing and
Joty, Shafiq and
Hoque, Enamul",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2022",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.findings-acl.177",
doi = "10.18653/v1/2022.findings-acl.177",
pages = "2263--2279",
}
@misc{CLEVR-Math,
doi = {10.48550/ARXIV.2208.05358},
url = {https://arxiv.org/abs/2208.05358},
author = {Lindström, Adam Dahlgren},
keywords = {Machine Learning (cs.LG), Computation and Language (cs.CL), Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences, I.2.7; I.2.10; I.2.6; I.4.8; I.1.4},
title = {CLEVR-Math: A Dataset for Compositional Language, Visual, and Mathematical Reasoning},
publisher = {arXiv},
year = {2022},
copyright = {Creative Commons Attribution Share Alike 4.0 International}
}
@misc{CLEVR,
title={CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning},
author={Justin Johnson and Bharath Hariharan and Laurens van der Maaten and Li Fei-Fei and C. Lawrence Zitnick and Ross Girshick},
year={2016},
eprint={1612.06890},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@inproceedings{CocoQA,
author = {Ren, Mengye and Kiros, Ryan and Zemel, Richard},
booktitle = {Advances in Neural Information Processing Systems},
editor = {C. Cortes and N. Lawrence and D. Lee and M. Sugiyama and R. Garnett},
pages = {},
publisher = {Curran Associates, Inc.},
title = {Exploring Models and Data for Image Question Answering},
url = {https://proceedings.neurips.cc/paper_files/paper/2015/file/831c2f88a604a07ca94314b56a4921b8-Paper.pdf},
volume = {28},
year = {2015}
}
@misc{DaTikz,
title={AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZ},
author={Jonas Belouadi and Anne Lauscher and Steffen Eger},
year={2024},
eprint={2310.00367},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Diagram image to text: https://huggingface.co/datasets/Kamizuru00/diagram_image_to_text by @Kamizuru00
@INPROCEEDINGS{DocVQA,
author={Mathew, Minesh and Karatzas, Dimosthenis and Jawahar, C. V.},
booktitle={2021 IEEE Winter Conference on Applications of Computer Vision (WACV)},
title={DocVQA: A Dataset for VQA on Document Images},
year={2021},
volume={},
number={},
pages={2199-2208},
keywords={Visualization;Computer vision;Text analysis;Image recognition;Image analysis;Conferences;Layout},
doi={10.1109/WACV48630.2021.00225}}
@inproceedings{DVQA,
title={DVQA: Understanding Data Visualizations via Question Answering},
author={Kafle, Kushal and Cohen, Scott and Price, Brian and Kanan, Christopher},
booktitle={CVPR},
year={2018}
}
@misc{FigureQA,
title={FigureQA: An Annotated Figure Dataset for Visual Reasoning},
author={Samira Ebrahimi Kahou and Vincent Michalski and Adam Atkinson and Akos Kadar and Adam Trischler and Yoshua Bengio},
year={2018},
eprint={1710.07300},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@inproceedings{FinQA,
title = "{F}in{QA}: A Dataset of Numerical Reasoning over Financial Data",
author = "Chen, Zhiyu and
Chen, Wenhu and
Smiley, Charese and
Shah, Sameena and
Borova, Iana and
Langdon, Dylan and
Moussa, Reema and
Beane, Matt and
Huang, Ting-Hao and
Routledge, Bryan and
Wang, William Yang",
editor = "Moens, Marie-Francine and
Huang, Xuanjing and
Specia, Lucia and
Yih, Scott Wen-tau",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-main.300",
doi = "10.18653/v1/2021.emnlp-main.300",
pages = "3697--3711",
}
@misc{GeomVerse,
title={GeomVerse: A Systematic Evaluation of Large Models for Geometric Reasoning},
author={Mehran Kazemi and Hamidreza Alvari and Ankit Anand and Jialin Wu and Xi Chen and Radu Soricut},
year={2023},
eprint={2312.12241},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@inproceedings{hatefulmeme,
author = {Kiela, Douwe and Firooz, Hamed and Mohan, Aravind and Goswami, Vedanuj and Singh, Amanpreet and Ringshia, Pratik and Testuggine, Davide},
booktitle = {Advances in Neural Information Processing Systems},
editor = {H. Larochelle and M. Ranzato and R. Hadsell and M.F. Balcan and H. Lin},
pages = {2611--2624},
publisher = {Curran Associates, Inc.},
title = {The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes},
url = {https://proceedings.neurips.cc/paper_files/paper/2020/file/1b84c4cee2b8b3d823b30e2d604b1878-Paper.pdf},
volume = {33},
year = {2020}
}
@inproceedings{Hitab,
title = "{H}i{T}ab: A Hierarchical Table Dataset for Question Answering and Natural Language Generation",
author = "Cheng, Zhoujun and
Dong, Haoyu and
Wang, Zhiruo and
Jia, Ran and
Guo, Jiaqi and
Gao, Yan and
Han, Shi and
Lou, Jian-Guang and
Zhang, Dongmei",
editor = "Muresan, Smaranda and
Nakov, Preslav and
Villavicencio, Aline",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-long.78",
doi = "10.18653/v1/2022.acl-long.78",
pages = "1094--1110",
}
@article{IAM,
author = {Marti, Urs-Viktor and Bunke, H.},
year = {2002},
month = {11},
pages = {39-46},
title = {The IAM-database: An English sentence database for offline handwriting recognition},
volume = {5},
journal = {International Journal on Document Analysis and Recognition},
doi = {10.1007/s100320200071}
}
@inproceedings{IconQA,
title = {IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning},
author = {Lu, Pan and Qiu, Liang and Chen, Jiaqi and Xia, Tony and Zhao, Yizhou and Zhang, Wei and Yu, Zhou and Liang, Xiaodan and Zhu, Song-Chun},
booktitle = {The 35th Conference on Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks},
year = {2021}
}
@INPROCEEDINGS{InfographicVQA,
author={Mathew, Minesh and Bagal, Viraj and Tito, Rubèn and Karatzas, Dimosthenis and Valveny, Ernest and Jawahar, C. V.},
booktitle={2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
title={InfographicVQA},
year={2022},
volume={},
number={},
pages={2582-2591},
keywords={Visualization;Computer vision;Computational modeling;Layout;Data visualization;Benchmark testing;Brain modeling;Document Analysis Datasets;Evaluation and Comparison of Vision Algorithms;Vision and Languages},
doi={10.1109/WACV51458.2022.00264}
}
@inproceedings{Inter-GPS,
title = {Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning},
author = {Lu, Pan and Gong, Ran and Jiang, Shibiao and Qiu, Liang and Huang, Siyuan and Liang, Xiaodan and Zhu, Song-Chun},
booktitle = {The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021)},
year = {2021}
}
@misc{LocalizedNarratives,
title={Connecting Vision and Language with Localized Narratives},
author={Jordi Pont-Tuset and Jasper Uijlings and Soravit Changpinyo and Radu Soricut and Vittorio Ferrari},
year={2020},
eprint={1912.03098},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{MapQA,
title={MapQA: A Dataset for Question Answering on Choropleth Maps},
author={Shuaichen Chang and David Palzer and Jialin Li and Eric Fosler-Lussier and Ningchuan Xiao},
year={2022},
eprint={2211.08545},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{MIMIC-IT-General-Scene-Difference,
title={MIMIC-IT: Multi-Modal In-Context Instruction Tuning},
author={Bo Li and Yuanhan Zhang and Liangyu Chen and Jinghao Wang and Fanyi Pu and Jingkang Yang and Chunyuan Li and Ziwei Liu},
year={2023},
eprint={2306.05425},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@inproceedings{Multihiertt,
title = "{M}ulti{H}iertt: Numerical Reasoning over Multi Hierarchical Tabular and Textual Data",
author = "Zhao, Yilun and
Li, Yunxiang and
Li, Chenying and
Zhang, Rui",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-long.454",
pages = "6588--6600",
}
@inproceedings{NLVR2,
title = "A Corpus for Reasoning about Natural Language Grounded in Photographs",
author = "Suhr, Alane and
Zhou, Stephanie and
Zhang, Ally and
Zhang, Iris and
Bai, Huajun and
Artzi, Yoav",
editor = "Korhonen, Anna and
Traum, David and
M{\`a}rquez, Llu{\'\i}s",
booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2019",
address = "Florence, Italy",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/P19-1644",
doi = "10.18653/v1/P19-1644",
pages = "6418--6428",
}
@INPROCEEDINGS{OCR-VQA,
author={Mishra, Anand and Shekhar, Shashank and Singh, Ajeet Kumar and Chakraborty, Anirban},
booktitle={2019 International Conference on Document Analysis and Recognition (ICDAR)},
title={OCR-VQA: Visual Question Answering by Reading Text in Images},
year={2019},
volume={},
number={},
pages={947-952},
keywords={Optical character recognition software;Visualization;Task analysis;Knowledge discovery;Text analysis;Text recognition;Character recognition;Optical Character Recognition (OCR), Visual Question Answering (VQA), Document image analysis, textVQA},
doi={10.1109/ICDAR.2019.00156}
}
@InProceedings{okvqa,
author = {Kenneth Marino and Mohammad Rastegari and Ali Farhadi and Roozbeh Mottaghi},
title = {OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge},
booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2019},
}
@InProceedings{PlotQA,
author = {Methani, Nitesh and Ganguly, Pritha and Khapra, Mitesh M. and Kumar, Pratyush},
title = {PlotQA: Reasoning over Scientific Plots},
booktitle = {The IEEE Winter Conference on Applications of Computer Vision (WACV)},
month = {March},
year = {2020}
}
@inproceedings{RAVEN,
title={RAVEN: A Dataset for Relational and Analogical Visual rEasoNing},
author={Zhang, Chi and Gao, Feng and Jia, Baoxiong and Zhu, Yixin and Zhu, Song-Chun},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2019}
}
RenderedText: https://huggingface.co/datasets/wendlerc/RenderedText by @wendlerc
@inproceedings{Robut,
title = "{R}obu{T}: A Systematic Study of Table {QA} Robustness Against Human-Annotated Adversarial Perturbations",
author = "Zhao, Yilun and
Zhao, Chen and
Nan, Linyong and
Qi, Zhenting and
Zhang, Wenlin and
Tang, Xiangru and
Mi, Boyu and
Radev, Dragomir",
editor = "Rogers, Anna and
Boyd-Graber, Jordan and
Okazaki, Naoaki",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.334",
doi = "10.18653/v1/2023.acl-long.334",
pages = "6064--6081",
}
@inproceedings{SQA,
title = "Search-based Neural Structured Learning for Sequential Question Answering",
author = "Iyyer, Mohit and
Yih, Wen-tau and
Chang, Ming-Wei",
editor = "Barzilay, Regina and
Kan, Min-Yen",
booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2017",
address = "Vancouver, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/P17-1167",
doi = "10.18653/v1/P17-1167",
pages = "1821--1831",
}
@misc{WikiSQL,
title={Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning},
author={Victor Zhong and Caiming Xiong and Richard Socher},
year={2017},
eprint={1709.00103},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@inproceedings{WTQ,
title = "Compositional Semantic Parsing on Semi-Structured Tables",
author = "Pasupat, Panupong and
Liang, Percy",
editor = "Zong, Chengqing and
Strube, Michael",
booktitle = "Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
month = jul,
year = "2015",
address = "Beijing, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/P15-1142",
doi = "10.3115/v1/P15-1142",
pages = "1470--1480",
}
@inproceedings{ScienceQA,
author = {Lu, Pan and Mishra, Swaroop and Xia, Tanglin and Qiu, Liang and Chang, Kai-Wei and Zhu, Song-Chun and Tafjord, Oyvind and Clark, Peter and Kalyan, Ashwin},
booktitle = {Advances in Neural Information Processing Systems},
editor = {S. Koyejo and S. Mohamed and A. Agarwal and D. Belgrave and K. Cho and A. Oh},
pages = {2507--2521},
publisher = {Curran Associates, Inc.},
title = {Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering},
url = {https://proceedings.neurips.cc/paper_files/paper/2022/file/11332b6b6cf4485b84afadb1352d3a9a-Paper-Conference.pdf},
volume = {35},
year = {2022}
}
@inproceedings{screen2words,
author = {Wang, Bryan and Li, Gang and Zhou, Xin and Chen, Zhourong and Grossman, Tovi and Li, Yang},
title = {Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning},
year = {2021},
isbn = {9781450386357},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3472749.3474765},
doi = {10.1145/3472749.3474765},
booktitle = {The 34th Annual ACM Symposium on User Interface Software and Technology},
pages = {498–510},
numpages = {13},
keywords = {Mobile UI summarization, dataset., deep learning, language-based UI, screen understanding},
location = {Virtual Event, USA},
series = {UIST '21}
}
@inproceedings{SpotTheDiff,
title = "Learning to Describe Differences Between Pairs of Similar Images",
author = "Jhamtani, Harsh and
others",
editor = "Riloff, Ellen and
Chiang, David and
Hockenmaier, Julia and
Tsujii, Jun{'}ichi",
booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",
month = oct # "-" # nov,
year = "2018",
address = "Brussels, Belgium",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/D18-1436",
doi = "10.18653/v1/D18-1436",
pages = "4024--4034",
}
@INPROCEEDINGS{STVQA,
author={Biten, Ali Furkan and Tito, Rubèn and Mafla, Andrés and Gomez, Lluis and Rusiñol, Marçal and Jawahar, C.V. and Valveny, Ernest and Karatzas, Dimosthenis},
booktitle={2019 IEEE/CVF International Conference on Computer Vision (ICCV)},
title={Scene Text Visual Question Answering},
year={2019},
volume={},
number={},
pages={4290-4300},
keywords={Visualization;Task analysis;Knowledge discovery;Text recognition;Cognition;Computer vision;Semantics},
doi={10.1109/ICCV.2019.00439}
}
@inproceedings{TabMWP,
title={Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning},
author={Lu, Pan and Qiu, Liang and Chang, Kai-Wei and Wu, Ying Nian and Zhu, Song-Chun and Rajpurohit, Tanmay and Clark, Peter and Kalyan, Ashwin},
booktitle={International Conference on Learning Representations (ICLR)},
year={2023}
}
@inproceedings{TallyQA,
title={TallyQA: Answering Complex Counting Questions},
author={Acharya, Manoj and Kafle, Kushal and Kanan, Christopher},
booktitle={AAAI},
year={2019}
}
@inproceedings{TAT-QA,
title = "{TAT}-{QA}: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance",
author = "Zhu, Fengbin and
Lei, Wenqiang and
Huang, Youcheng and
Wang, Chao and
Zhang, Shuo and
Lv, Jiancheng and
Feng, Fuli and
Chua, Tat-Seng",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-long.254",
doi = "10.18653/v1/2021.acl-long.254",
pages = "3277--3287"
}
@misc{textcaps,
title={TextCaps: a Dataset for Image Captioning with Reading Comprehension},
author={Oleksii Sidorov and Ronghang Hu and Marcus Rohrbach and Amanpreet Singh},
year={2020},
eprint={2003.12462},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@inproceedings{textvqa,
title={Towards VQA Models That Can Read},
author={Singh, Amanpreet and Natarjan, Vivek and Shah, Meet and Jiang, Yu and Chen, Xinlei and Parikh, Devi and Rohrbach, Marcus},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
pages={8317-8326},
year={2019}
}
@INPROCEEDINGS{TQA,
author={Kembhavi, Aniruddha and Seo, Minjoon and Schwenk, Dustin and Choi, Jonghyun and Farhadi, Ali and Hajishirzi, Hannaneh},
booktitle={2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
title={Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension},
year={2017},
volume={},
number={},
pages={5376-5384},
keywords={Knowledge discovery;Visualization;Cognition;Training;Natural languages;Computer vision},
doi={10.1109/CVPR.2017.571}
}
@inproceedings{VisText,
title = {{VisText: A Benchmark for Semantically Rich Chart Captioning}},
author = {Benny J. Tang AND Angie Boggust AND Arvind Satyanarayan},
booktitle = {The Annual Meeting of the Association for Computational Linguistics (ACL)},
year = {2023},
url = {http://vis.csail.mit.edu/pubs/vistext}
}
@InProceedings{Visual7w,
title = {{Visual7W: Grounded Question Answering in Images}},
author = {Yuke Zhu and Oliver Groth and Michael Bernstein and Li Fei-Fei},
booktitle = {{IEEE Conference on Computer Vision and Pattern Recognition}},
year = 2016,
}
@inproceedings{VisualMRC,
author = {Ryota Tanaka and
Kyosuke Nishida and
Sen Yoshida},
title = {VisualMRC: Machine Reading Comprehension on Document Images},
booktitle = {AAAI},
year = {2021}
}
@article{VQA-RAD,
author = {Lau, Jason and Gayen, Soumya and Ben Abacha, Asma and Demner-Fushman, Dina},
year = {2018},
month = {11},
pages = {180251},
title = {A dataset of clinically generated visual questions and answers about radiology images},
volume = {5},
journal = {Scientific Data},
doi = {10.1038/sdata.2018.251}
}
@misc{VQAv2,
title={Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering},
author={Yash Goyal and Tejas Khot and Douglas Summers-Stay and Dhruv Batra and Devi Parikh},
year={2017},
eprint={1612.00837},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{VSR,
title={Visual Spatial Reasoning},
author={Fangyu Liu and Guy Emerson and Nigel Collier},
year={2023},
eprint={2205.00363},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@misc{WebSight,
title={Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset},
author={Hugo Laurençon and Léo Tronchon and Victor Sanh},
year={2024},
eprint={2403.09029},
archivePrefix={arXiv},
primaryClass={cs.HC}
}
</details>
## Licensing Information
Each of the publicly available sub-datasets present in the Cauldron are governed by specific licensing conditions. Therefore, when making use of them you must take into consideration each of the licenses governing each dataset.
To the extent we have any rights in the prompts, these are licensed under CC-BY-4.0.
## Citation Information
If you are using this dataset, please cite
```
@misc{laurençon2024matters,
title={What matters when building vision-language models?},
author={Hugo Laurençon and Léo Tronchon and Matthieu Cord and Victor Sanh},
year={2024},
eprint={2405.02246},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```
### 数据集信息
dataset_info:
- 配置名称: ai2d
特征:
- 名称: images
序列: 图像
- 名称: texts
列表:
- 名称: user
数据类型: 字符串
- 名称: assistant
数据类型: 字符串
- 名称: source
数据类型: 字符串
拆分:
- 名称: train
字节数: 435362437.84770346
样本数: 2434
下载大小: 438136609
数据集大小: 435362437.84770346
- 配置名称: aokvqa
特征:
- 名称: images
序列: 图像
- 名称: texts
列表:
- 名称: user
数据类型: 字符串
- 名称: assistant
数据类型: 字符串
- 名称: source
数据类型: 字符串
拆分:
- 名称: train
字节数: 871997710.0
样本数: 16539
下载大小: 893265070
数据集大小: 871997710.0
...其余配置项格式一致,此处省略重复内容
# 数据集卡片:The Cauldron

## 数据集概述
The Cauldron 隶属于 Idefics2 发布项目,是一个包含50个视觉语言数据集(仅训练拆分)的超大规模集合,用于对视觉语言模型 Idefics2 进行微调。
## 数据集加载
如需加载本数据集,请先通过 `pip install datasets` 安装 `datasets` 库,随后可使用如下代码加载数据集:
python
from datasets import load_dataset
ds = load_dataset("HuggingFaceM4/the_cauldron", "ai2d")
上述代码将下载并加载名为`ai2d`的配置子集。
## 数据字段
一个典型的样本示例如下:
python
{
"images": [PIL.Image],
"texts": [
{
"user": "Question: How many actions are depicted in the diagram?
Choices:
A. 6.
B. 4.
C. 8.
D. 7.
Answer with the letter.",
"assistant": "Answer: D",
"source": "TQA"
}
]
}
其中,`images` 为图像列表,需置于对应文本之前;`texts` 为用户与助手围绕图像展开的多轮对话列表,每个元素代表一轮交互。
## The Cauldron 数据集统计详情
| 数据集类别 | 数据集名称 | 图像数量 | 问答对数量 | Token 数量 |
|--------------------------|----------------------|----------|-------------|------------------|
| *通用视觉问答* | | | | |
| | VQAv2 | 82,772 | 443,757 | 1,595,929 |
| | COCO-QA | 46,287 | 78,736 | 286,982 |
| | Visual7W | 14,366 | 69,817 | 279,268 |
| | A-OKVQA | 16,539 | 17,056 | 236,492 |
| | TallyQA | 98,680 | 183,986 | 738,254 |
| | OK-VQA | 8,998 | 9,009 | 38,853 |
| | HatefulMemes | 8,500 | 8,500 | 25,500 |
| | VQA-RAD | 313 | 1,793 | 8,418 |
| *图像描述* | | | | |
| | Localized Narratives | 507,444 | 507,444 | 21,328,731 |
| | Screen2Words | 15,730 | 15,743 | 143,103 |
| | VSR | 2,157 | 3,354 | 10,062 |
| *OCR、文档理解与文本转录* | | | | |
| | RenderedText | 999,000 | 999,000 | 27,207,774 |
| | DocVQA | 10,189 | 39,463 | 337,829 |
| | TextCaps | 21,953 | 21,953 | 389,658 |
| | TextVQA | 21,953 | 34,602 | 181,918 |
| | ST-VQA | 17,247 | 23,121 | 127,846 |
| | OCR-VQA | 165,746 | 801,579 | 6,073,824 |
| | VisualMRC | 3,027 | 11,988 | 168,828 |
| | IAM | 5,663 | 5,663 | 144,216 |
| | InfographicVQA | 2,118 | 10,074 | 61,048 |
| | Diagram image-to-text| 300 | 300 | 22,196 |
| *图表/图形理解* | | | | |
| | Chart2Text | 26,985 | 30,242 | 2,852,827 |
| | DVQA | 200,000 | 2,325,316 | 8,346,234 |
| | VisText | 7,057 | 9,969 | 1,245,485 |
| | ChartQA | 18,271 | 28,299 | 185,835 |
| | PlotQA | 157,070 | 20,249,479 | 8,478,299.28 |
| | FigureQA | 100,000 | 1,327,368 | 3,982,104 |
| | MapQA | 37,417 | 483,416 | 6,470,485 |
| *表格理解* | | | | |
| | TabMWP | 22,729 | 23,059 | 1,948,166 |
| | TAT-QA | 2,199 | 13,215 | 283,776 |
| | HiTab | 2,500 | 7,782 | 351,299 |
| | MultiHiertt | 7,619 | 7,830 | 267,615 |
| | FinQA | 5,276 | 6,251 | 242,561 |
| | WikiSQL | 74,989 | 86,202 | 9,680,673 |
| | SQA | 8,514 | 34,141 | 1,894,824 |
| | WTQ | 38,246 | 44,096 | 6,677,013 |
| *推理、逻辑与数学* | | | | |
| | GeomVerse | 9,303 | 9,339 | 2,489,459 |
| | CLEVR-Math | 70,000 | 788,650 | 3,184,656 |
| | CLEVR | 70,000 | 699,989 | 2,396,781 |
| | IconQA | 27,315 | 29,859 | 112,969 |
| | RAVEN | 42,000 | 42,000 | 105,081 |
| | Inter-GPs | 1,451 | 2,101 | 8,404 |
| *教科书/学术问答* | | | | |
| | AI2D | 3,099 | 9,708 | 38,832 |
| | TQA | 1,496 | 6,501 | 26,004 |
| | ScienceQA | 4,985 | 6,218 | 24,872 |
| *双图像差异识别* | | | | |
| | NLVR2 | 50,426 | 86,373 | 259,119 |
| | GSD | 70,939 | 141,869 | 4,637,229 |
| | Spot the diff | 8,566 | 9,524 | 221,477 |
| *截图转代码* | | | | |
| | WebSight | 500,000 | 500,000 | 276,743,299 |
| | DaTikz | 47,974 | 48,296 | 59,556,252 |
## 数据去重与清洗
The Cauldron 仅包含各子数据集的训练拆分。此外,我们已移除少量包含与 MMMU、MathVista 或 MMBench 测试集中重复图像的样本。
## 原始数据集参考文献
<details>
<summary>展开查看原始数据集参考文献</summary>
本数据集所包含的各原始数据集的引用信息如下,完整内容可参考原文:
bibtex
@misc{AI2D,
title={A Diagram Is Worth A Dozen Images},
author={Aniruddha Kembhavi and Mike Salvato and Eric Kolve and Minjoon Seo and Hannaneh Hajishirzi and Ali Farhadi},
year={2016},
eprint={1603.07396},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{A-OKVQA,
title={A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge},
author={Dustin Schwenk and Apoorv Khandelwal and Christopher Clark and Kenneth Marino and Roozbeh Mottaghi},
year={2022},
eprint={2206.01718},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
</details>
## 许可信息
本数据集所包含的所有公开可用子数据集均受各自的许可条款约束,因此在使用时需遵守各数据集对应的许可协议。就本数据集的提示部分而言,我们将其以 CC-BY-4.0 协议进行授权。
## 引用规范
若您在学术工作中使用本数据集,请引用如下文献:
bibtex
@misc{laurençon2024matters,
title={What matters when building vision-language models?},
author={Hugo Laurençon and Léo Tronchon and Matthieu Cord and Victor Sanh},
year={2024},
eprint={2405.02246},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
提供机构:
HuggingFaceM4
原始信息汇总
数据集概述
数据集列表
| 配置名称 | 特征 | 训练集详情 |
|---|---|---|
| ai2d | images, texts | 2434 examples, 435362437.84770346 bytes |
| aokvqa | images, texts | 16539 examples, 871997710.0 bytes |
| chart2text | images, texts | 26961 examples, 1060566797.2728182 bytes |
| chartqa | images, texts | 18265 examples, 784719364.9441738 bytes |
| clevr | images, texts | 70000 examples, 11522617868.0 bytes |
| clevr_math | images, texts | 70000 examples, 13308311206.0 bytes |
| cocoqa | images, texts | 46287 examples, 2213960474.0 bytes |
| datikz | images, texts | 47974 examples, 481233278.0 bytes |
| diagram_image_to_text | images, texts | 300 examples, 18877197.0 bytes |
| docvqa | images, texts | 10189 examples, 6885686042.0 bytes |
| dvqa | images, texts | 200000 examples, 3689940101.0 bytes |
| figureqa | images, texts | 100000 examples, 1901887152.0 bytes |
| finqa | images, texts | 5276 examples, 135268568.0 bytes |
| geomverse | images, texts | 9303 examples, 951640204.0 bytes |
| hateful_memes | images, texts | 8500 examples, 3035059823.0 bytes |
| hitab | images, texts | 2500 examples, 161130580.0 bytes |
| iam | images, texts | 5663 examples, 1129180352.0 bytes |
| iconqa | images, texts | 27307 examples, 264513634.7170419 bytes |
| infographic_vqa | images, texts | 2118 examples, 291677986.0 bytes |
| intergps | images, texts | 1280 examples, 24982328.291771192 bytes |
| localized_narratives | images, texts | 199998 examples, 21380844262.41927 bytes |
| mapqa | images, texts | 37417 examples, 3238062926.0 bytes |
| mimic_cgd | images, texts | 70939 examples, 12592929433.0 bytes |
| multihiertt | images, texts | 7619 examples, 1356766489.046 bytes |
| nlvr2 | images, texts | 50426 examples, 8375492591.0 bytes |
| ocrvqa | images, texts | 165746 examples, 5467134439.0 bytes |
| okvqa | images, texts | 9009 examples, 281454288182.492 bytes |
| plotqa | images, texts | 157070 examples, 7837605221.0 bytes |
| raven | images, texts | 42000 examples, 1506550467.0 bytes |
| rendered_text | images, texts | 10000 examples, 11086896502.0 bytes |
| robut_sqa | images, texts | 8514 examples, 679135952.0 bytes |
| robut_wikisql | images, texts | 74989 examples, 5950915477.0 bytes |
| robut_wtq | images, texts | 38246 examples, 4023729236.0 bytes |
| scienceqa | images, texts | 4976 examples, 284601898.76188564 bytes |
| screen2words | images, texts | 15730 examples, 1670723783.0 bytes |
| spot_the_diff | images, texts | 8566 examples, 1643123792.0 bytes |
| st_vqa | images, texts | 17247 examples, 696265340.0 bytes |
| tabmwp | images, texts | 22722 examples, 265337140.19648907 bytes |
| tallyqa | images, texts | 98680 examples, 4267143189.0 bytes |
| tat_qa | images, texts | 2199 examples, 73213942.0 bytes |
| textcaps | images, texts | 21953 examples, 5938676115.0 bytes |
| textvqa | images, texts | 21953 examples, 5939437331.0 bytes |
| tqa | images, texts | 1493 examples, 380346870.806369 bytes |
| vistext | images, texts | 9969 examples, 541250281.0 bytes |
| visual7w | images, texts | 14366 examples, 4432168161.0 bytes |
| visualmrc | images, texts | 3027 examples, 2941051627.2639995 bytes |
| vqarad | images, texts | 313 examples, 16561537.0 bytes |
| vqav2 | images, texts | 82772 examples, 10630091683.0 bytes |
| vsr | images, texts | 2157 examples, 107489763.0 bytes |
| websight | images, texts | 10000 examples, 2011365901.0 bytes |
特征描述
- images: 图像数据,类型为序列(sequence)。
- texts: 文本数据,包含以下子特征:
- user: 数据类型为字符串(string)。
- assistant: 数据类型为字符串(string)。
- source: 数据类型为字符串(string)。
搜集汇总
数据集介绍

构建方式
HuggingFaceM4/the_cauldron数据集构建方式涉及多个子数据集,每个子数据集均包含图像和文本信息。图像以序列形式存储,而文本信息则包括用户、助手的提问和回答以及来源。这些子数据集在训练集上的数据量从几百到几十万不等,数据集大小从几百兆到几十吉不等。每个子数据集都有其独特的下载大小和训练集大小,这为研究提供了丰富的选择。
使用方法
使用HuggingFaceM4/the_cauldron数据集的方法首先需要下载对应的数据集。数据集下载后,可以通过HuggingFace提供的API进行访问。用户可以根据需要选择不同的子数据集和训练集,然后使用HuggingFace的API进行数据加载和预处理。在数据加载后,用户可以进行模型训练、评估等操作。
背景与挑战
背景概述
HuggingFaceM4/the_cauldron数据集是一个包含多种视觉问答任务的大型数据集,旨在促进视觉问答领域的研究。该数据集包含了图像、文本以及用户和助手之间的对话,涵盖了从简单到复杂的各种问答任务。该数据集的创建时间、主要研究人员或机构、核心研究问题以及对相关领域的影响力等信息并未在提供的内容中明确提及。然而,从数据集的结构和内容来看,它无疑对视觉问答领域的研究起到了积极的推动作用。
当前挑战
HuggingFaceM4/the_cauldron数据集在推动视觉问答领域研究的同时,也面临着一些挑战。首先,数据集的规模和复杂性使得训练和评估模型变得更加困难。其次,数据集的多样性可能带来数据分布的不均衡,需要更多的研究来解决模型在处理不同类型问题时的性能差异。此外,视觉问答任务的复杂性也要求模型能够理解和处理视觉和文本信息之间的复杂关系,这需要更深入的研究和更先进的模型设计。
常用场景
经典使用场景
HuggingFaceM4/the_cauldron 数据集是一个多模态数据集,包含了图像和文本数据。其最经典的使用场景包括视觉问答(Visual Question Answering, VQA)任务,如图像描述生成、图像问答等。这些任务通常需要模型理解图像内容并回答相关问题,例如描述图像内容、回答关于图像的问题等。
解决学术问题
HuggingFaceM4/the_cauldron 数据集解决了多模态数据理解与分析的学术研究问题。该数据集包含了大量图像和文本数据,为研究人员提供了丰富的实验材料。通过使用该数据集,研究人员可以研究如何将图像和文本信息有效结合,提高模型的准确性和鲁棒性。此外,该数据集还促进了多模态数据预处理、特征提取、模型训练等方面的研究。
实际应用
HuggingFaceM4/the_cauldron 数据集在实际应用场景中具有广泛的应用价值。例如,在图像描述生成领域,该数据集可以帮助计算机视觉系统更好地理解和描述图像内容,从而为用户提供更加准确的图像描述信息。在图像问答领域,该数据集可以帮助计算机视觉系统更好地理解和回答关于图像的问题,从而为用户提供更加丰富的图像相关信息。此外,该数据集还可以应用于图像检索、图像标注、图像分类等领域。
数据集最近研究
最新研究方向
HuggingFaceM4/the_cauldron数据集的最新研究方向主要聚焦于图像与文本的交互理解,尤其是在视觉问答、图像描述生成和视觉推理等任务上。研究者们正致力于提升模型对于复杂场景的解析能力,以及对于用户查询的准确响应,以推动人工智能在视觉认知领域的应用发展。
以上内容由遇见数据集搜集并总结生成



