five

The Met Dataset|艺术作品识别数据集|计算机视觉数据集

收藏
arXiv2022-02-04 更新2024-06-21 收录
艺术作品识别
计算机视觉
下载链接:
http://cmp.felk.cvut.cz/met/
下载链接
链接失效反馈
资源简介:
The Met Dataset是一个专为大规模实例级艺术作品识别设计的数据集,由捷克技术大学电气工程学院等多个研究机构的专家合作创建。该数据集包含约418,605张图像,涵盖超过224,408个独特的艺术作品类别,这些作品来自世界各地,时间跨度从旧石器时代至今。数据集的构建过程严谨,通过从大都会艺术博物馆的公开收藏中筛选和标注图像,确保了数据的高质量和准确性。The Met Dataset不仅为艺术领域的计算机视觉研究提供了丰富的资源,还为开发和评估新的识别技术提供了平台,特别是在解决艺术作品的实例级识别问题上具有重要价值。
提供机构:
捷克技术大学电气工程学院
创建时间:
2022-02-04
AI搜集汇总
数据集介绍
main_image_url
构建方式
The Met Dataset is meticulously constructed by leveraging the open-access collection of The Metropolitan Museum of Art (The Met) in New York. The training set comprises approximately 400,000 images, encompassing over 224,000 unique classes, each corresponding to a distinct museum exhibit. These images are captured under controlled studio conditions, ensuring high-quality representations of the artworks. The testing set includes images taken by museum visitors, introducing a distribution shift that challenges the robustness of recognition models. Additionally, a set of distractor images, unrelated to The Met exhibits, is included to simulate out-of-distribution detection scenarios. This comprehensive dataset adheres to the evaluation protocol of the Google Landmarks Dataset (GLD), fostering research on domain-independent instance-level recognition approaches.
特点
The Met Dataset stands out for its large-scale instance-level recognition challenges, including high inter-class similarity, long-tail distribution, and numerous classes. The dataset's training images are meticulously curated under studio conditions, while the testing images, captured by museum visitors, introduce a significant distribution shift. This dual setup not only tests the model's ability to recognize artworks under varying conditions but also its capability to handle out-of-distribution queries. The dataset's meticulous annotation and verification processes ensure minimal noise, making it a reliable benchmark for instance-level recognition research. Furthermore, its public availability and adherence to the GLD evaluation protocol encourage comparative studies and advancements in domain-agnostic recognition techniques.
使用方法
The Met Dataset is designed for training and evaluating models on instance-level recognition tasks within the domain of artworks. Researchers can utilize the dataset to develop and test models that can accurately classify and retrieve artworks based on images taken under diverse conditions. The dataset's structure, with a large training set of studio-captured images and a testing set of visitor-taken images, allows for the assessment of model robustness and generalization capabilities. Additionally, the inclusion of distractor images enables the evaluation of out-of-distribution detection performance. Researchers can employ various machine learning techniques, including deep learning models, to extract features and classify images. The dataset's public availability and detailed documentation facilitate reproducibility and comparative analysis, making it an invaluable resource for advancing instance-level recognition research.
背景与挑战
背景概述
The Met Dataset, introduced in 2021 by researchers from Czech Technical University in Prague, Osaka University, Columbia University, and the University of Amsterdam, represents a pioneering effort in large-scale instance-level recognition within the domain of artworks. This dataset leverages the open-access collection of The Metropolitan Museum of Art (The Met) to form a comprehensive training set comprising approximately 400,000 images from over 224,000 unique exhibits. Each exhibit defines its own class, making it a unique resource for instance-level classification tasks. The dataset's creation addresses the critical need for large-scale, accurately labeled datasets in the field of instance-level recognition, particularly in the realm of artworks, which has historically attracted less attention compared to category-level recognition tasks. The Met Dataset not only facilitates research in artwork recognition but also serves as a benchmark for domain-independent approaches, encouraging advancements in instance-level recognition across various domains.
当前挑战
The Met Dataset presents several significant challenges. Firstly, the task of instance-level recognition in artworks is inherently difficult due to the large inter-class similarity, long-tail distribution of classes, and the sheer number of classes involved. The dataset also introduces a distribution shift between training images, which are taken under studio conditions, and testing images, which are captured by museum visitors, posing additional complexities. Additionally, the inclusion of out-of-distribution (OOD) images in the test set further complicates the recognition task, resembling an out-of-distribution detection problem. The creation process of the dataset itself involved meticulous annotation and verification to ensure the accuracy of labels, a tedious process given the scale and diversity of the collection. Despite these challenges, the Met Dataset stands as a robust benchmark, pushing the boundaries of instance-level recognition research and offering a fertile ground for future comparative studies.
常用场景
经典使用场景
The Met Dataset 在艺术品实例级识别领域具有经典应用场景,主要用于训练和测试模型在大型艺术品数据库中的实例级识别能力。该数据集通过利用大都会艺术博物馆的开放访问收藏,构建了一个包含约224,000个类别的训练集,每个类别对应一个博物馆展品。测试集则主要由博物馆游客拍摄的照片组成,这些照片在训练和测试之间引入了分布偏移,使得任务更具挑战性。
衍生相关工作
The Met Dataset 的发布催生了一系列相关研究工作,特别是在艺术品实例级识别和跨领域实例级识别方面。研究者们利用该数据集开发了多种深度学习模型,如结合自监督学习和监督对比学习的模型,显著提升了识别性能。此外,该数据集还激发了对长尾分布数据处理、分布外检测等问题的深入研究,推动了计算机视觉领域在这些方向上的进展。
数据集最近研究
最新研究方向
在艺术作品领域,The Met Dataset的出现为大规模实例级识别研究提供了新的契机。该数据集通过利用大都会艺术博物馆的开放访问收藏,构建了一个包含约224k个类别的训练集,每个类别对应一个博物馆展品,并在工作室条件下拍摄照片。测试集则主要由博物馆游客拍摄的照片组成,引入了训练与测试之间的分布偏移。此外,数据集还包括一组与Met展品无关的图像,使任务类似于分布外检测问题。该数据集的提出遵循了其他领域实例级识别数据集的范式,鼓励研究领域无关的方法。最近的研究方向包括结合自监督和监督对比学习来训练骨干网络,以及探索非参数分类方法,这些方法在处理大规模实例级识别任务中显示出潜力。
相关研究论文
  • 1
    The Met Dataset: Instance-level Recognition for Artworks捷克技术大学电气工程学院 · 2022年
以上内容由AI搜集并总结生成
用户留言
有没有相关的论文或文献参考?
这个数据集是基于什么背景创建的?
数据集的作者是谁?
能帮我联系到这个数据集的作者吗?
这个数据集如何下载?
点击留言
数据主题
具身智能
数据集  4098个
机构  8个
大模型
数据集  439个
机构  10个
无人机
数据集  37个
机构  6个
指令微调
数据集  36个
机构  6个
蛋白质结构
数据集  50个
机构  8个
空间智能
数据集  21个
机构  5个
5,000+
优质数据集
54 个
任务类型
进入经典数据集
热门数据集

中国1km分辨率逐月降水量数据集(1901-2023)

该数据集为中国逐月降水量数据,空间分辨率为0.0083333°(约1km),时间为1901.1-2023.12。数据格式为NETCDF,即.nc格式。该数据集是根据CRU发布的全球0.5°气候数据集以及WorldClim发布的全球高分辨率气候数据集,通过Delta空间降尺度方案在中国降尺度生成的。并且,使用496个独立气象观测点数据进行验证,验证结果可信。本数据集包含的地理空间范围是全国主要陆地(包含港澳台地区),不含南海岛礁等区域。为了便于存储,数据均为int16型存于nc文件中,降水单位为0.1mm。 nc数据可使用ArcMAP软件打开制图; 并可用Matlab软件进行提取处理,Matlab发布了读入与存储nc文件的函数,读取函数为ncread,切换到nc文件存储文件夹,语句表达为:ncread (‘XXX.nc’,‘var’, [i j t],[leni lenj lent]),其中XXX.nc为文件名,为字符串需要’’;var是从XXX.nc中读取的变量名,为字符串需要’’;i、j、t分别为读取数据的起始行、列、时间,leni、lenj、lent i分别为在行、列、时间维度上读取的长度。这样,研究区内任何地区、任何时间段均可用此函数读取。Matlab的help里面有很多关于nc数据的命令,可查看。数据坐标系统建议使用WGS84。

国家青藏高原科学数据中心 收录

HUSTgearbox

This reposotory release a gearbox failure dataset, which can support intelliegnt fault diagnosis research

github 收录

Figshare

Figshare是一个在线数据共享平台,允许研究人员上传和共享各种类型的研究成果,包括数据集、论文、图像、视频等。它旨在促进科学研究的开放性和可重复性。

figshare.com 收录

中国空气质量数据集(2014-2020年)

数据集中的空气质量数据类型包括PM2.5, PM10, SO2, NO2, O3, CO, AQI,包含了2014-2020年全国360个城市的逐日空气质量监测数据。监测数据来自中国环境监测总站的全国城市空气质量实时发布平台,每日更新。数据集的原始文件为CSV的文本记录,通过空间化处理生产出Shape格式的空间数据。数据集包括CSV格式和Shape格式两数数据格式。

国家地球系统科学数据中心 收录

MOOCs Dataset

该数据集包含了大规模开放在线课程(MOOCs)的相关数据,包括课程信息、用户行为、学习进度等。数据主要用于研究在线教育的行为模式和学习效果。

www.kaggle.com 收录