Paintings Gemma-Enriched Dataset. Fotothek - Bibliotheca Hertziana
收藏DataCite Commons2025-12-11 更新2026-05-04 收录
下载链接:
https://edmond.mpg.de/citation?persistentId=doi:10.17617/3.Z8W2JR
下载链接
链接失效反馈官方服务:
资源简介:
<h1
id="gemälde-dataset---ai-enhanced-art-historical-descriptions">Gemälde
Dataset - AI-Enhanced Art Historical Descriptions</h1>
<p>This dataset contains 224x224 images and relative metadata extracted
from the MIDAS XML of the Catalogue of the Photographic Collection of
the Bibliotheca Hertziana enriched with AI-generated prose texts. The
dataset is limited to photographs of objects classified as painting
(Gemälde), and has been processed using <a
href="https://huggingface.co/google/gemma-2-9b-it">Google Gemma 2 9B
Instruct</a> large language model on the <a
href="https://docs.hpc.gwdg.de/">KISSKI HPC cluster</a> of the GWDG.
Scripts to process the data on KISSKI have been elaborated with Claude
Code in Virtual Studio Code.</p>
<hr />
<h2 id="dataset-overview">Dataset Overview</h2>
<p><strong>Source Data:</strong></p>
<ul>
<li>Original dataset: <code>gemalde.tsv</code> (19,051 rows)</li>
<li>Extracted from: MIDAS XML format (<code>combined.xml</code>)</li>
<li>Institution: <a href="https://www.biblhertz.it/">Photographic
Collection. Bibliotheca Hertziana - Max Planck Institute for Art
History</a></li>
<li>Photographic Collection Catalogue: <a
href="https://foto.biblhertz.it/">Fotothek der Bibliotheca
Hertziana</a></li>
</ul>
<p><strong>Output:</strong></p>
<ul>
<li><strong>Enriched metadata:</strong> TSV files with AI-generated
German and English descriptions</li>
<li><strong>224x224 images downloaded from IIIF Image Api of the
Photographic Collection</strong></li>
</ul>
<hr />
<h2 id="processing-pipeline">Processing Pipeline</h2>
<h3 id="data-extraction">1. Data Extraction</h3>
<p>Source data was extracted with gemalde.xql from <a
href="https://edmond.mpg.de/dataset.xhtml?persistentId=doi:10.17617/3.8GPSDJ">MIDAS
XML format combined.xml</a> containing structured art historical
metadata including:</p>
<ul>
<li>Object titles and descriptions (<code>textobj</code>,
<code>textfoto</code>)</li>
<li>Artist information (<code>aob30</code>)</li>
<li>Location data (<code>aob26</code>, <code>aob28</code>)</li>
<li>Dating and provenance</li>
<li>Image references (<code>a8540</code>)</li>
</ul>
<h4 id="images-download">Images Download</h4>
<p>224x224 images downloaded in advance from the IIIF Service based on
<code>gemalde.tsv</code>. The script processing for AI Text Enrichment
from the metadata checks that the image has been downloaded, so the
output data has a 100% certainty of having a matching image. 17,657
images downloaded from 19,051 rows. This is due to known missing digital
images. The dataset corresponds to published data and each row contains
the licence and accessibility of the single image, date of creation and
last update of the catalogue object.</p>
<h3 id="ai-text-generation">2. AI Text Generation</h3>
<p><strong>Model Used:</strong></p>
<ul>
<li><strong>Name:</strong> <a
href="https://huggingface.co/google/gemma-2-9b-it">Google Gemma 2 9B
Instruct</a></li>
<li><strong>Parameters:</strong> 9 billion</li>
<li><strong>Quantization:</strong> FP16 (no quantization)</li>
<li><strong>Context window:</strong> 8,192 tokens</li>
<li><strong>License:</strong> Gemma Terms of Use</li>
</ul>
<p><strong>Processing Workflow:</strong></p>
<ol type="1">
<li><strong>Input cleaning:</strong> Removal of numeric codes,
normalization of Unicode characters</li>
<li><strong>Paragraph generation:</strong> German text from structured
metadata</li>
<li><strong>Translation:</strong> German → English</li>
<li><strong>Categories processed:</strong>
<ul>
<li><code>paragraph foto DE/EN</code> - Photograph description</li>
<li><code>paragraph obj DE/EN</code> - Object/artwork description</li>
<li><code>paragraph verwalter DE/EN</code> - Collection/custodian
information</li>
<li><code>paragraph standort DE/EN</code> - Location information</li>
</ul></li>
</ol>
<hr />
<h2 id="ai-prompts-used">AI Prompts Used</h2>
<h3 id="paragraph-generation-prompt">Paragraph Generation Prompt</h3>
<pre><code>Convert the following structured information into a coherent text in German.
The text contains field data that should be transformed into flowing prose while preserving all information.
IMPORTANT:
- Write a MAXIMUM of 2 paragraphs
- Do NOT include any URLs or web links
- Do NOT include reference codes or numerical codes
- Do NOT add any comments or explanations
- Only output the paragraph text itself
Field: {field_name}
Text: {cleaned_text}
German text (maximum 2 paragraphs):</code></pre>
<p><strong>Example Input:</strong></p>
<pre><code>Field: textobj
Text: Bildnis Filippo Neri Hl. Filippo Neri geboren 1515 Florenz gestorben 1595
Rom Priester Ordensgründer Gründer Oratorium Kongregation des Oratoriums</code></pre>
<p><strong>Example Output:</strong></p>
<pre><code>Filippo Neri, geboren 1515 in Florenz und gestorben 1595 in Rom, war ein Priester
und bedeutender Ordensgründer. Er gründete das Oratorium und die Kongregation des
Oratoriums, die bis heute eine wichtige Rolle in der katholischen Kirche spielen.</code></pre>
<h3 id="translation-prompt">Translation Prompt</h3>
<pre><code>Translate the following German text to English.
Preserve the meaning and style as much as possible.
IMPORTANT:
- Do NOT include any URLs or web links in the translation
- Do NOT include reference codes starting with &quot;bh&quot; followed by numbers
- Do NOT include numerical codes like 08012353
- Do NOT add any comments or explanations
- Only output the translated text itself
German text: {text}
English translation:</code></pre>
<p><strong>Example Translation:</strong></p>
<pre><code>Input (DE): Filippo Neri, geboren 1515 in Florenz und gestorben 1595 in Rom,
war ein Priester und bedeutender Ordensgründer.
Output (EN): Filippo Neri, born 1515 in Florence and died 1595 in Rome, was
a priest and important founder of a religious order.</code></pre>
<hr />
<h2 id="kisski-cluster-resources">KISSKI Cluster Resources</h2>
<h3 id="hardware-configuration">Hardware Configuration</h3>
<p><strong>GPU:</strong> NVIDIA A100 (80GB VRAM)</p>
<ul>
<li>Architecture: Ampere</li>
<li>Tensor Cores: 432</li>
<li>FP16 Performance: ~312 TFLOPS</li>
<li>Memory Bandwidth: 2 TB/s</li>
</ul>
<p><strong>Allocation per job:</strong></p>
<ul>
<li>GPUs: 1× A100</li>
<li>CPUs: 4 cores</li>
<li>RAM: 64 GB</li>
<li>Time limit: 6 hours per job</li>
</ul>
<h3 id="job-array-configuration">Job Array Configuration</h3>
<p><strong>Array setup:</strong></p>
<ul>
<li><strong>Total jobs:</strong> 38 (indices 0-37)</li>
<li><strong>Chunk size:</strong> 500 rows per job</li>
<li><strong>Parallel jobs:</strong> 10 simultaneous</li>
<li><strong>Total rows processed:</strong> 19,000 (rows 0-18,999)</li>
</ul>
<h3 id="performance-metrics">Performance Metrics</h3>
<p><strong>AI operations per row:</strong></p>
<ul>
<li>4 paragraph generations (foto, obj, verwalter, standort)</li>
<li>4 translations (DE → EN)</li>
<li><strong>Total:</strong> 8 LLM inference calls per row</li>
</ul>
<p><strong>Resource consumption:</strong></p>
<ul>
<li><strong>GPU hours:</strong> ~125 GPU hours total (38 jobs × 3.3
hours)</li>
<li><strong>Model size in memory:</strong> ~18 GB (FP16)</li>
<li><strong>Peak VRAM usage:</strong> ~25 GB per job</li>
</ul>
<hr />
<h2 id="output-structure">Output Structure</h2>
<pre><code>data_gemalde/
├── enriched_data/
│ ├── data_0-499.tsv # Rows 0-499
│ ├── data_500-999.tsv # Rows 500-999
│ ├── data_1000-1499.tsv # Rows 1000-1499
│ └── ...
├── images/
│ ├── {image_id_1}.jpg # IIIF thumbnail (224×224)
│ ├── {image_id_2}.jpg
│ └── ...
└── README.md # This file</code></pre>
<h3 id="output-fields">Output Fields</h3>
<p>Each TSV file contains the original metadata plus AI-generated
fields:</p>
<p><strong>Original fields:</strong> All fields from
<code>gemalde.tsv</code> including:</p>
<ul>
<li><code>a8540</code> - Image ID (BILDDATEI-NR.)</li>
<li><code>textobj</code> - Original object text</li>
<li><code>textfoto</code> - Original photo text</li>
<li><code>aob26</code>, <code>aob28</code>, <code>aob30</code> -
Relations</li>
<li>etc.</li>
</ul>
<p><strong>Generated fields:</strong></p>
<ul>
<li><code>paragraph foto DE</code> - German description of
photograph</li>
<li><code>paragraph foto EN</code> - English translation</li>
<li><code>paragraph obj DE</code> - German description of
object/artwork</li>
<li><code>paragraph obj EN</code> - English translation</li>
<li><code>paragraph verwalter DE</code> - German description of
collection</li>
<li><code>paragraph verwalter EN</code> - English translation</li>
<li><code>paragraph standort DE</code> - German description of
location</li>
<li><code>paragraph standort EN</code> - English translation</li>
</ul>
<hr />
<h2 id="technical-details">Technical Details</h2>
<h3 id="model-configuration">Model Configuration</h3>
<div class="sourceCode" id="cb7"><pre
class="sourceCode python"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>model <span class="op">=</span> AutoModelForCausalLM.from_pretrained(</span>
<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a> <span class="st">&quot;google/gemma-2-9b-it&quot;</span>,</span>
<span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a> device_map<span class="op">=</span><span class="st">&quot;cuda&quot;</span>,</span>
<span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a> torch_dtype<span class="op">=</span>torch.float16,</span>
<span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a> local_files_only<span class="op">=</span><span class="va">True</span></span>
<span id="cb7-6"><a href="#cb7-6" aria-hidden="true" tabindex="-1"></a>)</span></code></pre></div>
<h3 id="generation-parameters">Generation Parameters</h3>
<p><strong>Paragraph generation:</strong></p>
<ul>
<li><code>max_new_tokens</code>: 500</li>
<li><code>temperature</code>: 0.7</li>
<li><code>top_p</code>: 0.9</li>
<li><code>do_sample</code>: True</li>
</ul>
<p><strong>Translation:</strong></p>
<ul>
<li><code>max_new_tokens</code>: 500</li>
<li><code>temperature</code>: 0.3 (lower for more deterministic
translation)</li>
<li><code>top_p</code>: 0.9</li>
</ul>
<hr />
<h2 id="kisski-documentation">KISSKI Documentation</h2>
<ul>
<li><strong>Main documentation:</strong> https://docs.hpc.gwdg.de/</li>
<li><strong>GPU partitions:</strong>
https://docs.hpc.gwdg.de/how_to_use/compute_partitions/gpu_partitions/</li>
<li><strong>Account types:</strong>
https://docs.hpc.gwdg.de/start_here/account_types/</li>
</ul>
<hr />
<h2 id="data-usage-citation">Data Usage &amp; Citation</h2>
<p><strong>Source Institution:</strong> Bibliotheca Hertziana - Max
Planck Institute for Art History</p>
<ul>
<li>Website: https://www.biblhertz.it/</li>
<li>Fotothek: https://fotothek.biblhertz.it/</li>
</ul>
<p><strong>AI Processing:</strong></p>
<ul>
<li>Model: Google Gemma 2 9B Instruct</li>
<li>Infrastructure: KISSKI (GWDG Göttingen)</li>
<li>Processing date: November 2024</li>
</ul>
<p><strong>License:</strong> Please refer to the Bibliotheca Hertziana
for source data licensing terms.</p>
<hr />
<h2 id="quality-notes">Quality Notes</h2>
<ul>
<li>AI-generated texts are meant to enhance discoverability and
accessibility</li>
<li>Generated descriptions may contain inaccuracies or
interpretations</li>
<li>Always refer to original structured metadata (<code>textobj</code>,
<code>textfoto</code>) for authoritative information</li>
<li>Translations preserve meaning but may not capture all nuances of art
historical terminology</li>
</ul>
<hr />
<p><strong>Generated:</strong> November 2024 <strong>Processing
location:</strong> KISSKI HPC Cluster, GWDG Göttingen
<strong>Contact:</strong> pietro.liuzzo@biblhertz.it</p>
提供机构:
Edmond
创建时间:
2025-11-24



