cilabuniba/wikifragments-visual-arts-embeds
收藏Hugging Face2026-02-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/cilabuniba/wikifragments-visual-arts-embeds
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
language:
- en
task_categories:
- text-generation
- visual-document-retrieval
- visual-question-answering
- sentence-similarity
tags:
- retrieval-augmented-generation
- RAG
- multimodal
pretty_name: WikiFragments - Visual Arts Pages with Fragments
size_categories:
- 1M<n<10M
---
# WikiFragments - Visual Arts Pages with Fragments (WikiFragmentsVA)
**WikiFragmentsVA** is a domain-specific multimodal dataset focused on the visual arts, derived from [Wikipedia (en)](https://en.wikipedia.org/). It consists of textual paragraphs paired with related images (infoboxes and thumbnails), rendered as unified visual fragments. This dataset extends the base WikiFragments project by providing pre-rendered fragment images and multi-vector embeddings obtained via [ColQwen2 v1.0](https://huggingface.co/vidore/colqwen2-v1.0), including optimized pooled representations for efficient retrieval.

*Example of a rendered fragment with multiple images and captions.*
## Dataset Details
### Dataset Description
WikiFragmentsVA is a specialized subset of the [WikiFragments](https://huggingface.co/datasets/cilabuniba/WikiFragments) dataset, curated to cover the Visual Arts domain. To construct this dataset, we recursively navigated Wikipedia categories starting from "Category:Visual arts" and descending up to 5 depth levels.
A **multimodal fragment** is defined as an atomic knowledge unit consisting of a paragraph from a Wikipedia page and all images that, in the page’s source code, appear above that paragraph. For this dataset, each fragment is rendered into a single image resembling a document layout (images/captions in a grid at the top, paragraph at the bottom) and encoded into multi-vector representations using ColQwen2.
- **Curated by:** Nicola Fanelli (PhD Student @ University of Bari Aldo Moro, Italy)
- **Language(s) (NLP):** English
### License
- **Code**: MIT License.
- **Text Data**: The Wikipedia text is licensed under [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/). When using this dataset, you must provide proper attribution to Wikipedia and its contributors and share any derivatives under the same license.
- **Images**: Images are sourced from Wikipedia and Wikimedia Commons. Each image is subject to its own license, which is typically indicated on its original page. Users of this dataset are responsible for ensuring they comply with the licensing terms of individual images. For ease of use, we provide the license and attribution information for each image in the dataset, along with the corresponding URLs to download them at the resolution available on Wikipedia.
### Dataset Sources
- **Repository:** [ArtSeek Official GitHub](https://github.com/cilabuniba/artseek)
- **Paper:** [ArtSeek: Deep artwork understanding via multimodal in-context reasoning and late interaction retrieval](https://arxiv.org/abs/2507.21917)
- **Full Dataset:** [WikiFragments (Full)](https://huggingface.co/datasets/cilabuniba/wikifragments)
## Uses
### Direct Use
This dataset is designed for **multimodal retrieval-augmented generation (RAG)** in the visual arts domain. It supports:
- **Two-stage retrieval:** Using optimized pooled embeddings for fast initial filtering and full multi-vector embeddings for late-interaction re-ranking.
- **Multimodal grounding:** Providing rendered visual context to MLLMs for answering complex questions about art history, styles, and artists.
- **Visual Document Retrieval:** Evaluating models on their ability to retrieve documents based on visual and textual alignment.
### Out-of-Scope Use
- Real-time systems (the dataset is a static snapshot).
- Commercial use without verifying individual image licenses via Wikimedia Commons.
- High-stakes factual applications where real-time verification is required.
## Dataset Structure
Each data point represents a **multimodal fragment** with the following fields:
- `id`: Unique identifier.
- `title`: Wikipedia page title.
- `text`: Cleaned paragraph text.
- `url`: Wikipedia page URL.
- `images`: Struct containing image PIL objects, captions, licenses, and metadata.
- `fragment`: The fragment rendered as a stand-alone image (grid of images + paragraph).
- `full_embeddings`: Multi-vector embeddings from ColQwen2 v1.0.
- `pooled_embeddings`: Compressed 9-vector representations (special token centroid + 8 content centroids).
### Embedding Methodology
We use **ColQwen2**, which follows a late interaction architecture. Given a query \\(q\\) and a fragment (document) \\(d\\), they are encoded into multi-vector representations \\(E_q\\) and \\(E_d\\). The relevance score \\(S_{q,d}\\) is computed as:
$$S_{q,d} = \sum_{i \in [|E_q|]} \max_{j \in [|E_d|]} E_{q_i} \cdot E_{d_j}$$
To handle the memory footprint of storing millions of multi-vectors, we implement a **token pooling strategy**. The document embedding sequence is partitioned into:
- \\(E_d^{pref}\\): Prefix special tokens.
- \\(E_d^{content}\\): Visual and textual content embeddings.
- \\(E_d^{suff}\\): Suffix special tokens.
The pooled representation \\(E_d^{pool}\\) is then computed as:
$$E_d^{pool} = c(E_d^{pref} \cup E_d^{suff}) \oplus C_d$$
where \\(c(\cdot)\\) represents the centroid of the special tokens and \\(C_d\\) represents \\(K\\) centroids (with \\(K=8\\)) obtained by performing hierarchical clustering on the content embeddings.
## Dataset Creation
### Curation Rationale
The dataset was created to facilitate "ArtSeek," a framework for deep artwork understanding. By focusing on the Visual Arts domain, we provide a high-quality benchmark for evaluating retrieval and reasoning capabilities in a knowledge-rich field where visual and textual context are inseparable.
### Source Data
The data is sourced from the English Wikipedia (August 2024 snapshot) and the Kiwix full Wikipedia ZIM dump (January 2024).
#### Data Collection and Processing
1. **Filtering:** Pages were selected by recursively descending from "Category:Visual arts" to a depth of 5.
2. **Extraction:** Paragraphs and images were extracted using a modified `wikiextractor`.
3. **Rendering:** Fragments were rendered into images using the `FragmentCreator` tool, placing images in a grid above the text.
4. **Embedding:** We extracted embeddings using ColQwen2 and applied the clustering-based pooling mentioned above to create efficient retrieval indices.
#### Who are the source data producers?
Text was authored by Wikipedia contributors. Images were contributed to Wikimedia Commons by various users and are subject to individual licenses.
### Annotations
There are no manual annotations beyond the original captions associated with images from Wikipedia pages.
#### Annotation process
N/A.
#### Who are the annotators?
N/A.
#### Personal and Sensitive Information
The dataset is derived from public Wikipedia data and is not expected to contain sensitive personal information.
## Bias, Risks, and Limitations
- **Coverage Bias:** Inherits biases present in English Wikipedia regarding art history (e.g., potential Western-centric focus).
- **Temporal Limitation:** Reflects a snapshot in time.
- **Image Quality:** Uses lower-resolution images optimized for web rendering from Kiwix.
### Recommendations
Users should be aware of the inherited biases from Wikipedia contributors and editorial processes. Verify image licenses via provided URLs for any distribution.
## Citation
**BibTeX:**
```bibtex
@article{fanelli2025artseek,
title={ArtSeek: Deep artwork understanding via multimodal in-context reasoning and late interaction retrieval},
author={Fanelli, Nicola and Vessio, Gennaro and Castellano, Giovanna},
journal={arXiv preprint arXiv:2507.21917},
year={2025}
}
```
**APA:**
Fanelli, N., Vessio, G., & Castellano, G. (2025). ArtSeek: Deep artwork understanding via multimodal in-context reasoning and late interaction retrieval. arXiv preprint arXiv:2507.21917.
## Glossary
* **Late Interaction:** A retrieval mechanism that computes similarity by summing the maximum dot products between query and document token embeddings.
* **Token Pooling:** A technique to reduce the number of vectors stored per document by clustering embeddings into a fixed set of centroids.
<!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. -->
<!-- [More Information Needed] -->
<!-- ## More Information [optional] -->
<!-- [More Information Needed] -->
## Dataset Card Authors
Nicola Fanelli
## Dataset Card Contact
For questions, please contact: **nicola.fanelli@uniba.it**
提供机构:
cilabuniba



