Scaled and Translated Image Recognition (STIR)
收藏NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://zenodo.org/record/6578037
下载链接
链接失效反馈官方服务:
资源简介:
Paper: [2211.10288] Just a Matter of Scale? Reevaluating Scale Equivariance in Convolutional Neural Networks (arxiv.org)
Code: taltstidl/scale-equivariant-cnn: Official code for "Just a Matter of Scale? Reevaluating Scale Equivariance in Convolutional Neural Networks" (github.com)
While convolutions are known to be invariant to (discrete) translations, scaling continues to be a challenge and most image recognition networks are not invariant to them. To explore these effects, we have created the Scaled and Translated Image Recognition (STIR) dataset. This dataset contains objects of size \(s \in [17,64]\), each randomly placed in a \(64 \times 64\) pixel image.
Using the dataset
Depending on which data you are planning to use, download one or more of the following files. Data is stored in compressed .npz format and can be loaded as documented here.
File
Description
emoji.npz
Emoji vector icons rendered as white icon on black background
mnist.npz
Classic MNIST handwritten digits rescaled to varying sizes
trafficsign.npz
Traffic signs from street imagery downscaled to varying sizes
aerial.npz
Objects in aerial imagery downscaled to varying sizes
Each file contains multiple arrays that can be accessed in a dictionary-like fashion. The keys are documented below, where n is the number of classes for a given file and m is the number of instances for each class. Both emoji.npz (36 classes, 1 instance) and mnist.npz (10 classes, 50 instances) are in black & white while trafficsign.npz (16 classes, 25 instances) and aerial.npz (9 classes, 25 instances) are in color.
Key
Shape
Description
imgs
(3, 48, n, m, 64, 64) black & white, (3, 48, n, 64, 64, 3) color
Images grouped into 3 sets (training, validation, testing) and 48 different scales. Values will be in range 0 to 255.
lbls
(3, 48, n, m)
Indices referencing ground truth labels. See lbldata for descriptive names. Values will be in range 0 to n - 1.
scls
(3, 48, n, m)
Known scales as given by bounding box size. Values will be in range 17 to 64.
psts
(3, 48, n, m, 2)
Known position of bounding box. First value is distance to left edge, second value distance to top edge.
metadata
(6, 2)
Metadata on title, description, author, license, version and date.
lbldata
(n,)
Descriptive names for each ground truth labels.
For use in Python a dataset class is provided that implements the basic functionality for loading a certain split and scale selection, as illustrated in the code below. It ensures shuffling is done in a consistent manner such that ground truth scales and positions can be retrieved. Metadata and label descriptions can be retrieved via metadata and labeldata, respectively.
from data.dataset import STIRDataset
dataset = STIRDataset('data/emoji.npz')
# Obtain images and labels for training
images, labels = dataset.to_torch(split='train', scales=[32, 64], shuffle=True)
# Obtain known scales and positions for above
scales, positions = dataset.get_latents(split='train', scales=[32, 64], shuffle=True)
# Get metadata and label descriptions
metadata = dataset.metadata
label_descriptions = dataset.labeldata
License and Attribution
When using this dataset for your own research, please respect the individual licenses of the original data. These are distributed within the data files' metadata. For attribution in papers, we recommend the following citations.
D. Gandy, J. Otero, E. Emanuel, F. Botsford, J. Lundien, K. Jackson, M. Wilkerson, R. Madole, J. Raphael, T. Chase, G. Taglialatela, B. Talbot, and T. Chase. Font Awesome. https://fontawesome.com/v5/download, Nov. 2022.
Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86(11):2278–2324, Nov. 1998.
C. Ertler, J. Mislej, T. Ollmann, L. Porzi, G. Neuhold, and Y. Kuang. The Mapillary Traffic Sign Dataset for Detection and Classification on a Global Scale. In 2020 16th Eur. Conf. Comput. Vision (ECCV), Glasgow, UK, Aug. 2020.
G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, M. Pelillo, and L. Zhang. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In 2018 IEEE/CVF Conf. Comput. Vision and Pattern Recognition (CVPR), pages 3974–3983, Salt Lake City, UT, USA, June 2018.
### 论文与代码
论文:[2211.10288] 《仅关乎尺度?重新评估卷积神经网络中的尺度等变性》(arxiv.org)
代码:taltstidl/scale-equivariant-cnn:《仅关乎尺度?重新评估卷积神经网络中的尺度等变性》官方实现代码(github.com)
众所周知,卷积操作对(离散)平移具有不变性,但尺度变换仍是一项挑战性问题,当前多数图像识别网络并不具备尺度不变性。为探究此类特性与问题,我们构建了尺度变换与平移图像识别(Scaled and Translated Image Recognition, STIR)数据集。该数据集包含尺寸$s in [17,64]$的目标物体,每个物体随机放置在$64 imes 64$像素的图像中。
## 数据集使用指南
根据您的研究需求,可下载以下一个或多个数据文件。数据以压缩的.npz格式存储,可按照对应文档说明进行加载。
各文件说明如下:
- emoji.npz:以黑色为背景、白色图标渲染的表情符号矢量图标数据集
- mnist.npz:经不同尺度缩放后的经典MNIST手写数字数据集
- trafficsign.npz:经不同尺度下采样后的街景交通标志数据集
- aerial.npz:经不同尺度下采样后的航空影像目标数据集
每个文件包含多个可通过类似字典的方式访问的数组,各键的说明如下,其中$n$为对应数据集的类别总数,$m$为每类的样本数量。其中emoji.npz(36类,每类1个样本)与mnist.npz(10类,每类50个样本)为黑白图像,trafficsign.npz(16类,每类25个样本)与aerial.npz(9类,每类25个样本)为彩色图像。
各数组键的详细说明如下:
- imgs:形状为$(3, 48, n, m, 64, 64)$(黑白图像)或$(3, 48, n, 64, 64, 3)$(彩色图像)。图像被划分为3个集合(训练集、验证集、测试集)与48种不同尺度,像素值范围为0至255。
- lbls:形状为$(3, 48, n, m)$,为真实标签的索引,可通过lbldata获取类别名称,标签值范围为0至$n-1$。
- scls:形状为$(3, 48, n, m)$,为通过边界框尺寸得到的已知尺度,数值范围为17至64。
- psts:形状为$(3, 48, n, m, 2)$,为边界框的已知位置,第一个值为距左边缘的距离,第二个值为距上边缘的距离。
- metadata:形状为$(6, 2)$,为包含标题、描述、作者、许可证、版本与日期的元数据。
- lbldata:形状为$(n,)$,为每个真实标签对应的类别名称。
为便于在Python中使用,我们提供了STIRDataset数据集类,可实现加载指定划分与尺度选择的基础功能,如下方代码示例所示。该类确保以一致的方式进行数据洗牌,以便能够准确获取目标的真实尺度与位置信息。元数据与标签描述可分别通过metadata与labeldata属性获取。
python
from data.dataset import STIRDataset
dataset = STIRDataset('data/emoji.npz')
# Obtain images and labels for training
images, labels = dataset.to_torch(split='train', scales=[32, 64], shuffle=True)
# Obtain known scales and positions for above
scales, positions = dataset.get_latents(split='train', scales=[32, 64], shuffle=True)
# Get metadata and label descriptions
metadata = dataset.metadata
label_descriptions = dataset.labeldata
## 许可证与署名要求
若将本数据集用于您的研究工作,请尊重原始数据的各自许可证,相关许可信息已包含在数据文件的元数据中。若在学术论文中引用本数据集,我们推荐使用以下引用格式:
1. D. Gandy、J. Otero、E. Emanuel、F. Botsford、J. Lundien、K. Jackson、M. Wilkerson、R. Madole、J. Raphael、T. Chase、G. Taglialatela、B. Talbot及T. Chase. Font Awesome. https://fontawesome.com/v5/download,2022年11月。
2. Y. Lecun、L. Bottou、Y. Bengio及P. Haffner. 应用于文档识别的基于梯度学习方法. IEEE汇刊, 86(11):2278–2324,1998年11月。
3. C. Ertler、J. Mislej、T. Ollmann、L. Porzi、G. Neuhold及Y. Kuang. 面向全球尺度检测与分类的Mapillary交通标志数据集. 见:2020年第16届欧洲计算机视觉大会(ECCV),英国格拉斯哥,2020年8月。
4. G.-S. Xia、X. Bai、J. Ding、Z. Zhu、S. Belongie、J. Luo、M. Datcu、M. Pelillo及L. Zhang. DOTA:面向航空影像目标检测的大规模数据集. 见:2018年IEEE/CVF计算机视觉与模式识别会议(CVPR),美国犹他州盐湖城,2018年6月,第3974–3983页。
创建时间:
2022-11-23



