COCO 图像识别的数据集

Name: COCO 图像识别的数据集
Creator: 帕依提提
License: 暂无描述

帕依提提2024-03-04 收录

下载链接：

https://www.payititi.com/opendatasets/show-1248.html

下载链接

链接失效反馈

官方服务：

资源简介：

COCO 是一个大型图像数据集，其被用于机器视觉领域的对象检测与分割、人物关键点检测、填充分割与字幕生成。该数据集以场景理解为主，图像中的目标则通过精确的分割进行位置标定。该数据集具有目标分割、情景感知和超像素分割三个特征，其包含 33 万张图像、150 万目标实例、80 个目标类、91 个物品类以及 25 万关键点人物。 COCO数据集有91类，虽然比ImageNet和SUN类别少，但是每一类的图像多，这有利于获得更多的每类中位于某种特定场景的能力，对比PASCAL VOC，其有更多类和图像。 COCO数据集分两部分发布，前部分于2014年发布，后部分于2015年，2014年版本：82,783 training, 40,504 validation, and 40,775 testing images，有270k的segmented people和886k的segmented object；2015年版本：165,482 train, 81,208 val, and 81,434 test images。该数据集主要有的特点如下：（1）Object segmentation （2）Recognition in Context （3）Multiple objects per image （4）More than 300,000 images （5）More than 2 Million instances （6）80 object categories （7）5 captions per image （8）Keypoints on 100,000 people 为了更好的介绍这个数据集，微软在ECCV Workshops里发表这篇文章：Microsoft COCO: Common Objects in Context。从这篇文章中，我们了解了这个数据集以scene understanding为目标，主要从复杂的日常场景中截取，图像中的目标通过精确的segmentation进行位置的标定。图像包括91类目标，328,000影像和2,500,000个label。该数据集主要解决3个问题：目标检测，目标之间的上下文关系，目标的2维上的精确定位。数据集的对比示意图： Image Classification：分类需要二进制的标签来确定目标是否在图像中。早期数据集主要是位于空白背景下的单一目标，如MNIST手写数据库，COIL household objects。在机器学习领域的著名数据集有CIFAR-10 and CIFAR-100，在32*32影像上分别提供10和100类。最近最著名的分类数据集即ImageNet，22,000类，每类500-1000影像。 Object Detection：经典的情况下通过bounding box确定目标位置，期初主要用于人脸检测与行人检测，数据集如Caltech Pedestrian Dataset包含350,000个bounding box标签。PASCAL VOC数据包括20个目标超过11,000图像，超过27,000目标bounding box。最近还有ImageNet数据下获取的detection数据集，200类，400,000张图像，350,000个bounding box。由于一些目标之间有着强烈的关系而非独立存在，在特定场景下检测某种目标是是否有意义的，因此精确的位置信息比bounding box更加重要。 Semantic scene labeling：这类问题需要pixel级别的标签，其中个别目标很难定义，如街道和草地。数据集主要包括室内场景和室外场景的，一些数据集包括深度信息。其中，SUN dataset包括908个场景类，3,819个常规目标类(person, chair, car)和语义场景类(wall, sky, floor)，每类的数目具有较大的差别（这点COCO数据进行改进，保证每一类数据足够）。 Other vision datasets：一些数据集如Middlebury datasets，包含立体相对，多视角立体像对和光流；同时还有Berkeley Segmentation Data Set (BSDS500)，可以评价segmentation和edge detection算法。该数据集标记流程如下： COCO数据集有91类，虽然比ImageNet和SUN类别少，但是每一类的图像多，这有利于获得更多的每类中位于某种特定场景的能力，对比PASCAL VOC，其有更多类和图像。 COCO数据集分两部分发布，前部分于2014年发布，后部分于2015年，2014年版本：82,783 training, 40,504 validation, and 40,775 testing images，有270k的segmented people和886k的segmented object；2015年版本：165,482 train, 81,208 val, and 81,434 test images。其性能对比和一些例子：

COCO is a large-scale image dataset widely used in computer vision for tasks including object detection and segmentation, human keypoint detection, stuff segmentation, and image captioning. Focused on scene understanding, the dataset localizes objects in images via precise segmentation annotations. It features three core characteristics: object segmentation, context-aware recognition, and superpixel segmentation. The dataset contains 330,000 images, 1.5 million object instances, 80 object categories, 91 stuff categories, and 250,000 human keypoint annotations. The COCO dataset covers 91 categories. Although fewer in total categories than ImageNet and SUN datasets, each category in COCO has significantly more images, which enables models to learn better representations of objects under specific contextual scenarios. Compared with PASCAL VOC, COCO has more categories and a larger number of images. The COCO dataset was released in two phases: the first phase in 2014, and the second in 2015. The 2014 version includes 82,783 training, 40,504 validation, and 40,775 test images, with 270,000 segmented human instances and 886,000 segmented object instances. The 2015 version consists of 165,482 training, 81,208 validation, and 81,434 test images. Key characteristics of the COCO dataset are as follows: 1. Object segmentation 2. Context-aware recognition 3. Multiple objects per image 4. Over 300,000 images 5. Over 2 million object instances 6. 80 object categories 7. 5 captions per image 8. Keypoint annotations for 100,000 human individuals To formally introduce this dataset, Microsoft published the paper *Microsoft COCO: Common Objects in Context* at the ECCV Workshops. According to this paper, the dataset aims at scene understanding, with images collected from complex daily scenarios, and objects in the images are localized via precise segmentation annotations. It covers 91 object categories, 328,000 images, and 2.5 million annotation labels. The COCO dataset primarily addresses three core computer vision tasks: object detection, contextual relationship modeling between objects, and precise 2D localization of objects. ### Comparison with Other Vision Datasets #### Image Classification Image classification tasks require binary labels to determine whether a target exists in an image. Early datasets mainly focused on single objects against blank backgrounds, such as the MNIST handwritten digit dataset and COIL household objects. Well-known machine learning datasets include CIFAR-10 and CIFAR-100, which provide 10 and 100 categories respectively on 32×32 images. The most prominent recent classification dataset is ImageNet, which contains 22,000 categories with 500–1000 images per category. #### Object Detection Classic object detection tasks use bounding boxes to localize targets, which were initially applied to face detection and pedestrian detection. For example, the Caltech Pedestrian Dataset contains 350,000 bounding box annotations. The PASCAL VOC dataset includes 20 target categories across over 11,000 images, with more than 27,000 object bounding boxes. Recently, the detection dataset derived from ImageNet covers 200 categories, 400,000 images, and 350,000 bounding boxes. Since some objects have strong contextual relationships rather than being independent, detecting a target under specific scenarios is meaningful, and thus precise position information is more important than bounding boxes. #### Semantic Scene Labeling This task requires pixel-level annotations, where individual targets can be difficult to define, such as streets and grass. Datasets for this task mainly include indoor and outdoor scene datasets, some of which also provide depth information. The SUN dataset covers 908 scene categories, 3,819 general object categories (e.g., person, chair, car) and semantic scene categories (e.g., wall, sky, floor), with highly imbalanced sample counts per category. COCO improves this issue by ensuring sufficient samples for each category. #### Other Vision Datasets Other datasets include the Middlebury datasets, which contain stereo pairs, multi-view stereo pairs, and optical flow data; as well as the Berkeley Segmentation Data Set (BSDS500), which can be used to evaluate segmentation and edge detection algorithms. The COCO dataset covers 91 categories. Although fewer in total categories than ImageNet and SUN datasets, each category in COCO has significantly more images, which enables models to learn better representations of objects under specific contextual scenarios. Compared with PASCAL VOC, COCO has more categories and a larger number of images. The COCO dataset was released in two phases: the first phase in 2014, and the second in 2015. The 2014 version includes 82,783 training, 40,504 validation, and 40,775 test images, with 270,000 segmented human instances and 886,000 segmented object instances. The 2015 version consists of 165,482 training, 81,208 validation, and 81,434 test images. Performance comparisons and sample visualizations:

提供机构：

帕依提提

搜集汇总

数据集介绍

背景与挑战

背景概述

COCO数据集是一个用于机器视觉的大型图像数据集，包含33万张图像、150万目标实例和80个目标类，主要用于对象检测、分割和场景理解。其特点是每类图像数量多，标注精确，适用于复杂的场景分析任务。

以上内容由遇见数据集搜集并总结生成