five

Reubencf/magazines-multilingual-vqa

收藏
Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Reubencf/magazines-multilingual-vqa
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ar - de - en - es - fr - hi - it - ja - pt - zh license: cc-by-4.0 task_categories: - visual-question-answering - image-to-text pretty_name: Magazines Multilingual VQA size_categories: - 10K<n<100K tags: - ocr - vqa - multilingual - document-ai - magazines --- # Magazines Multilingual VQA A multilingual Visual Question Answering dataset built from 29,039 public-domain magazine and newspaper pages sourced from [archive.org](https://archive.org). Each page has: - **Verbatim OCR** in the page's native language - **English description, page type, scan-quality metadata** - **One grounded VQA pair** in one of 10 target languages (round-robin assigned) - **Full provenance and license** carried from the source archive.org record Annotations generated with Google Gemma 4 31B via vLLM. ## Dataset fields | Field | Description | |---|---| | `image` | Page image (JPEG) | | `ocr_text` | Verbatim text extraction in native language | | `page_language` | ISO-639-1 code of the page's language | | `description` | 2-3 sentence English description | | `page_type` | article / cover / advertisement / index / ... | | `nsfw_flag` | Safety flag (almost always false) | | `scan_quality` | clean / readable / degraded / illegible | | `ocr_confidence` | Model's self-assessed OCR accuracy | | `qa_language` | Language of the QA pair (round-robin) | | `question` | Question about the page in `qa_language` | | `answer` | Grounded answer in `qa_language` | | `question_type` | ocr / visual / comprehension / entity / layout / counting / inference | | `difficulty` | easy / medium / hard | | `source_title` | Magazine/newspaper title | | `publication_date` | Issue date | | `creator` | Author / creator (if available) | | `publisher` | Publisher (if available) | | `archive_identifier` | archive.org identifier | | `archive_url` | https://archive.org/details/<identifier> | | `license` | Short license label (e.g. public-domain, cc-by, cc-by-sa) | | `license_url` | Full license URL from archive.org | | `subject` | Archive subject tags | | `category` | Scraping category (language + media type) | ## Pages by source language | Language | Pages | |---|---| | ger | 4,412 | | French | 3,279 | | rus | 2,762 | | por | 2,047 | | vie | 1,637 | | ben | 1,598 | | fre | 1,279 | | mai | 1,026 | | English; Hindi | 1,004 | | German | 839 | | afr | 826 | | jpn | 580 | | hin | 521 | | arabic | 512 | | urd | 494 | | tam | 482 | | ita | 464 | | guj | 445 | | per | 419 | | nep | 391 | | tel | 384 | | tur | 379 | | chi | 347 | | ind | 347 | | pan | 334 | | dut | 286 | | mal | 209 | | English; Kannada | 186 | | russian | 173 | | ara | 171 | | tha | 154 | | amh | 123 | | yid | 108 | | Italian | 106 | | sat | 85 | | Albanian; Azerbijani; Brahui; Chichewa; Chinese; English; French; German; Greek; Indonesian; Kashmiri; Kazakh; Korean; Macedonian; Malayalam; Myanmar; PersianFarsi; Philipines; Portugese; Russian; Somali; Spanish; Swedish; Thai; Tamil; Turkish; Urdu; Vietnamese; Yoroba | 71 | | tel; English | 64 | | ori | 61 | | snd | 57 | | fas | 46 | | eng; ara; per | 44 | | Tamil; English | 43 | | kan | 40 | | mar | 31 | | Chinese | 30 | | spa | 29 | | ind; Indonesian | 27 | | som | 25 | | Telugu | 22 | | Russian | 21 | | Arabic | 16 | | English; Malayalam | 2 | | Dutch | 1 | ## VQA pair languages | Language | Pages | |---|---| | fr | 2,926 | | ar | 2,923 | | es | 2,922 | | zh | 2,917 | | pt | 2,901 | | de | 2,899 | | en | 2,896 | | hi | 2,894 | | ja | 2,892 | | it | 2,869 | ## License distribution | License | Pages | |---|---| | public-domain | 27,733 | | other | 1,306 | ## Licensing All source magazines are in the public domain or under Creative Commons licenses, as published on archive.org. The `license` and `license_url` columns preserve the original per-item license — please respect them when using individual records. The dataset compilation and annotations are released under **CC-BY-4.0**. ## Citation If you use this dataset, please credit both archive.org as the source of the magazine images and this dataset compilation. ## Intended use Training and evaluation of multilingual multimodal models for: - Document OCR - Visual question answering - Layout understanding - Multilingual document comprehension
提供机构:
Reubencf
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作