five

Multimodal3DIdent

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/7678230
下载链接
链接失效反馈
官方服务:
资源简介:
This upload contains the Multimodal3DIdent dataset introduced in the paper Identifiability Results for Multimodal Contrastive Learning presented at ICLR 2023. The dataset provides an identifiability benchmark with image/text pairs generated from controllable ground truth factors, some of which are shared between image and text modalities. The training, validation, and test sets contain 125000, 10000, and 10000 image/text pairs and ground truth factors, respectively. The code for the data generation is publicly available: https://github.com/imantdaunhawer/Multimodal3DIdent.   Description ------------------ The generated dataset contains image and text data as well as the ground truth factors of variation for each modality. Each split (train/val/test) of the dataset is structured as follows: . ├── images │ ├── 000000.png │ ├── 000001.png │ └── etc. ├── text │ └── text_raw.txt ├── latents_image.csv └── latents_text.csv The directories images and text contain the generated image and text data, whereas the CSV files latents_image.csv and latents_text.csv contain the values of the respective latent factors. There is an index-wise correspondence between images, sentences, and latent factors. For example, the first line in the file text_raw.txt is the sentence that corresponds to the first image in the images directory. Latent factors: We use the following ground truth latent factors to generate image and text data. Each factor is sampled from a uniform distribution defined on the specified set of values for the respective factor. Modality Latent Factor Values Details Image Object shape {0, 1, ..., 6} Mapped to Blender shapes like "Teapot", "Hare", etc. Image Object x-position {0, 1, 2} Mapped to {-3, 0, 3} for Blender Image Object y-position {0, 1, 2} Mapped to {-3, 0, 3} for Blender Image Object z-position {0} Constant Image Object alpha-rotation [0, 1]-interval Linearly transformed to [-pi/2, pi/2] for Blender Image Object beta-rotation [0, 1]-interval Linearly transformed to [-pi/2, pi/2] for Blender Image Object gamma-rotation [0, 1]-interval Linearly transformed to [-pi/2, pi/2] for Blender Image Object color [0, 1]-interval Hue value in HSV transformed to RGB for Blender Image Spotlight position [0, 1]-interval Transformed to a unique position on a semicircle Image Spotlight color [0, 1]-interval Hue value in HSV transformed to RGB for Blender Image Background color [0, 1]-interval Hue value in HSV transformed to RGB for Blender Text Object shape {0, 1, ..., 6} Mapped to strings like "teapot", "hare", etc. Text Object x-position {0, 1, 2} Mapped to strings "left", "center", "right" Text Object y-position {0, 1, 2} Mapped to strings "top", "mid", "bottom" Text Object color string values Color names from 3 different color palettes Text Text phrasing {0, 1, ..., 4} Mapped to 5 different English sentences Image rendering: We use the Blender rendering engine to create visually complex images depicting a 3D scene. Each image in the dataset shows a colored 3D object of a certain shape or class (i.e., teapot, hare, cow, armadillo, dragon, horse, or head) in front of a colored background and illuminated by a colored spotlight that is focused on the object and located on a semicircle above the scene. The resulting RGB images are of size 224 x 224 x 3. Text generation: We generate a short sentence describing the respective scene. Each sentence describes the object's shape or class (e.g., teapot), position (e.g., bottom-left), and color. The color is represented in a human-readable form (e.g., "lawngreen", "xkcd:bright aqua", etc.) as the name of the color (from a randomly sampled palette) that is closest to the sampled color value in RGB space. The sentence is constructed from one of five pre-configured phrases with placeholders for the respective ground truth factors. Relation between modalities: Three latent factors (object shape, x-position, y-position) are shared between image/text pairs. The object color also exhibits a dependence between modalities; however, it is not a 1-to-1 correspondence because the color palette is sampled randomly from a set of multiple palettes. Additionally, there is a causal dependence of object color on object x-position since the range of hue values [0, 1] is split into three equally sized intervals, each of which is associated with a fixed x-position of the object. For instance, if x-position is “left”, we sample the hue value from the interval [0, 1/3]. Consequently, the color of the object can be predicted to some degree from the object's position.   Acknowledgements ------------------------------- The Multimodal3DIdent dataset builds on the following resources: - 3DIdent dataset - Causal3DIdent dataset - CLEVR dataset - Blender open-source 3D creation suite
创建时间:
2023-03-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作