Reubencf/magazines-multilingual-vqa
收藏Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Reubencf/magazines-multilingual-vqa
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ar
- de
- en
- es
- fr
- hi
- it
- ja
- pt
- zh
license: cc-by-4.0
task_categories:
- visual-question-answering
- image-to-text
pretty_name: Magazines Multilingual VQA
size_categories:
- 10K<n<100K
tags:
- ocr
- vqa
- multilingual
- document-ai
- magazines
---
# Magazines Multilingual VQA
A multilingual Visual Question Answering dataset built from 29,039 public-domain magazine and newspaper pages sourced from [archive.org](https://archive.org). Each page has:
- **Verbatim OCR** in the page's native language
- **English description, page type, scan-quality metadata**
- **One grounded VQA pair** in one of 10 target languages (round-robin assigned)
- **Full provenance and license** carried from the source archive.org record
Annotations generated with Google Gemma 4 31B via vLLM.
## Dataset fields
| Field | Description |
|---|---|
| `image` | Page image (JPEG) |
| `ocr_text` | Verbatim text extraction in native language |
| `page_language` | ISO-639-1 code of the page's language |
| `description` | 2-3 sentence English description |
| `page_type` | article / cover / advertisement / index / ... |
| `nsfw_flag` | Safety flag (almost always false) |
| `scan_quality` | clean / readable / degraded / illegible |
| `ocr_confidence` | Model's self-assessed OCR accuracy |
| `qa_language` | Language of the QA pair (round-robin) |
| `question` | Question about the page in `qa_language` |
| `answer` | Grounded answer in `qa_language` |
| `question_type` | ocr / visual / comprehension / entity / layout / counting / inference |
| `difficulty` | easy / medium / hard |
| `source_title` | Magazine/newspaper title |
| `publication_date` | Issue date |
| `creator` | Author / creator (if available) |
| `publisher` | Publisher (if available) |
| `archive_identifier` | archive.org identifier |
| `archive_url` | https://archive.org/details/<identifier> |
| `license` | Short license label (e.g. public-domain, cc-by, cc-by-sa) |
| `license_url` | Full license URL from archive.org |
| `subject` | Archive subject tags |
| `category` | Scraping category (language + media type) |
## Pages by source language
| Language | Pages |
|---|---|
| ger | 4,412 |
| French | 3,279 |
| rus | 2,762 |
| por | 2,047 |
| vie | 1,637 |
| ben | 1,598 |
| fre | 1,279 |
| mai | 1,026 |
| English; Hindi | 1,004 |
| German | 839 |
| afr | 826 |
| jpn | 580 |
| hin | 521 |
| arabic | 512 |
| urd | 494 |
| tam | 482 |
| ita | 464 |
| guj | 445 |
| per | 419 |
| nep | 391 |
| tel | 384 |
| tur | 379 |
| chi | 347 |
| ind | 347 |
| pan | 334 |
| dut | 286 |
| mal | 209 |
| English; Kannada | 186 |
| russian | 173 |
| ara | 171 |
| tha | 154 |
| amh | 123 |
| yid | 108 |
| Italian | 106 |
| sat | 85 |
| Albanian; Azerbijani; Brahui; Chichewa; Chinese; English; French; German; Greek; Indonesian; Kashmiri; Kazakh; Korean; Macedonian; Malayalam; Myanmar; PersianFarsi; Philipines; Portugese; Russian; Somali; Spanish; Swedish; Thai; Tamil; Turkish; Urdu; Vietnamese; Yoroba | 71 |
| tel; English | 64 |
| ori | 61 |
| snd | 57 |
| fas | 46 |
| eng; ara; per | 44 |
| Tamil; English | 43 |
| kan | 40 |
| mar | 31 |
| Chinese | 30 |
| spa | 29 |
| ind; Indonesian | 27 |
| som | 25 |
| Telugu | 22 |
| Russian | 21 |
| Arabic | 16 |
| English; Malayalam | 2 |
| Dutch | 1 |
## VQA pair languages
| Language | Pages |
|---|---|
| fr | 2,926 |
| ar | 2,923 |
| es | 2,922 |
| zh | 2,917 |
| pt | 2,901 |
| de | 2,899 |
| en | 2,896 |
| hi | 2,894 |
| ja | 2,892 |
| it | 2,869 |
## License distribution
| License | Pages |
|---|---|
| public-domain | 27,733 |
| other | 1,306 |
## Licensing
All source magazines are in the public domain or under Creative Commons licenses, as published on archive.org. The `license` and `license_url` columns preserve the original per-item license — please respect them when using individual records.
The dataset compilation and annotations are released under **CC-BY-4.0**.
## Citation
If you use this dataset, please credit both archive.org as the source of the magazine images and this dataset compilation.
## Intended use
Training and evaluation of multilingual multimodal models for:
- Document OCR
- Visual question answering
- Layout understanding
- Multilingual document comprehension
提供机构:
Reubencf



