Multimodal3DIdent
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/7678230
下载链接
链接失效反馈官方服务:
资源简介:
This upload contains the Multimodal3DIdent dataset introduced in the paper Identifiability Results for Multimodal Contrastive Learning presented at ICLR 2023. The dataset provides an identifiability benchmark with image/text pairs generated from controllable ground truth factors, some of which are shared between image and text modalities. The training, validation, and test sets contain 125000, 10000, and 10000 image/text pairs and ground truth factors, respectively. The code for the data generation is publicly available: https://github.com/imantdaunhawer/Multimodal3DIdent.
Description
------------------
The generated dataset contains image and text data as well as the ground truth factors of variation for each modality. Each split (train/val/test) of the dataset is structured as follows:
.
├── images
│ ├── 000000.png
│ ├── 000001.png
│ └── etc.
├── text
│ └── text_raw.txt
├── latents_image.csv
└── latents_text.csv
The directories images and text contain the generated image and text data, whereas the CSV files latents_image.csv and latents_text.csv contain the values of the respective latent factors. There is an index-wise correspondence between images, sentences, and latent factors. For example, the first line in the file text_raw.txt is the sentence that corresponds to the first image in the images directory.
Latent factors: We use the following ground truth latent factors to generate image and text data. Each factor is sampled from a uniform distribution defined on the specified set of values for the respective factor.
Modality
Latent Factor
Values
Details
Image
Object shape
{0, 1, ..., 6}
Mapped to Blender shapes like "Teapot", "Hare", etc.
Image
Object x-position
{0, 1, 2}
Mapped to {-3, 0, 3} for Blender
Image
Object y-position
{0, 1, 2}
Mapped to {-3, 0, 3} for Blender
Image
Object z-position
{0}
Constant
Image
Object alpha-rotation
[0, 1]-interval
Linearly transformed to [-pi/2, pi/2] for Blender
Image
Object beta-rotation
[0, 1]-interval
Linearly transformed to [-pi/2, pi/2] for Blender
Image
Object gamma-rotation
[0, 1]-interval
Linearly transformed to [-pi/2, pi/2] for Blender
Image
Object color
[0, 1]-interval
Hue value in HSV transformed to RGB for Blender
Image
Spotlight position
[0, 1]-interval
Transformed to a unique position on a semicircle
Image
Spotlight color
[0, 1]-interval
Hue value in HSV transformed to RGB for Blender
Image
Background color
[0, 1]-interval
Hue value in HSV transformed to RGB for Blender
Text
Object shape
{0, 1, ..., 6}
Mapped to strings like "teapot", "hare", etc.
Text
Object x-position
{0, 1, 2}
Mapped to strings "left", "center", "right"
Text
Object y-position
{0, 1, 2}
Mapped to strings "top", "mid", "bottom"
Text
Object color
string values
Color names from 3 different color palettes
Text
Text phrasing
{0, 1, ..., 4}
Mapped to 5 different English sentences
Image rendering: We use the Blender rendering engine to create visually complex images depicting a 3D scene. Each image in the dataset shows a colored 3D object of a certain shape or class (i.e., teapot, hare, cow, armadillo, dragon, horse, or head) in front of a colored background and illuminated by a colored spotlight that is focused on the object and located on a semicircle above the scene. The resulting RGB images are of size 224 x 224 x 3.
Text generation: We generate a short sentence describing the respective scene. Each sentence describes the object's shape or class (e.g., teapot), position (e.g., bottom-left), and color. The color is represented in a human-readable form (e.g., "lawngreen", "xkcd:bright aqua", etc.) as the name of the color (from a randomly sampled palette) that is closest to the sampled color value in RGB space. The sentence is constructed from one of five pre-configured phrases with placeholders for the respective ground truth factors.
Relation between modalities: Three latent factors (object shape, x-position, y-position) are shared between image/text pairs. The object color also exhibits a dependence between modalities; however, it is not a 1-to-1 correspondence because the color palette is sampled randomly from a set of multiple palettes. Additionally, there is a causal dependence of object color on object x-position since the range of hue values [0, 1] is split into three equally sized intervals, each of which is associated with a fixed x-position of the object. For instance, if x-position is “left”, we sample the hue value from the interval [0, 1/3]. Consequently, the color of the object can be predicted to some degree from the object's position.
Acknowledgements
-------------------------------
The Multimodal3DIdent dataset builds on the following resources:
- 3DIdent dataset
- Causal3DIdent dataset
- CLEVR dataset
- Blender open-source 3D creation suite
创建时间:
2023-03-29



