five

10 Big Cats of the Wild - Image Classification

收藏
www.kaggle.com2023-02-28 更新2025-03-25 收录
下载链接:
https://www.kaggle.com/gpiosenka/cats-in-the-wild-image-classification
下载链接
链接失效反馈
官方服务:
资源简介:
Images were gathered from Google searches and downloaded using app 'download all images' . I highly recommend this app as it is very fast and returns a zip file with the images which you can then unzip to a specific directory. I have developed a custom set of tools to create datasets. The first tool used creates a dataset framework in a specified directory I call Datasets. It inputs the name of the new dataset and creates a directory with that name and within that directory creates 4 subdirectories train, test, valid and storage. The storage directory is where the unzipped downloaded images are placed. Downloaded images can be a crazy mix of ungodly file names and image formats. I wrote a python program called order_by_size. It operates on the downloaded images, within the storage directory, It removes files with extensions that are not jpg, png, or bmp and deletes files that are below a user specified image size. Then it renames the files sequentially using "zeros" padding and converts them to jpg format, and orders the files so that the first file is the largest image size, 2nd file is the next largest and so on. For the images in your dataset you want to start with images that are large. Later these images will be cropped to a region of interest and you want these cropped images to be large and have sufficient pixel count so that features can be extracted by your classification model. Now that the files are sequentially ordered and have jpg extensions I use another program called duplicate delete. This program uses file hashing to detect duplicate images and deletes any duplicates. This prevents having images in common between the train, test and validation images when the files are partitioned. Now when you do a Google search you will get a lot of what you want and also a lot of junk. I wrote another python program called review_images that sequentially shows each of the images in the storage directory and you can elect to delete or keep the image if it is the correct type of image you want. This then eliminates unwanted images from the storage directory. Then comes the hard part. If you want to build a high quality dataset you should crop your images so that the resulting image has a high ratio of pixels in the region of interest to the total number of pixels. For that I use paint shop pro version 9. If you examine the dataset images you will see that in most cases the image of the cat takes up at least 50% of the pixels in the image. After all that is done I use the order_by_size program again with different parameters which converts all the images to a specified size. For this dataset I used 224 X 224 X3 as the image size. Now we have a uniform ordered and properly pruned set of images for a specific class like tigers for example. I wrote another python program called make_class, it inputs the new class name (tiger for example) and creates a new class sub directory in the train, test and valid directories. Then it partitions the images in the storage directory into train images, test images and validation images and stores them in the class directory of the train, test and valid directories. Finally I wrote another python program that creates a dataset csv file. To make a high quality dataset takes a lot of work but the tools I have generated helps to reduce the work load.

本数据集的图像源自谷歌搜索引擎,并利用名为‘download all images’的应用程序进行下载。本人极力推荐该应用程序,因其操作迅捷,并能以压缩包形式返回图像,用户可随后解压至指定目录。本人已开发了一套定制工具以构建数据集。其中首个工具可在指定目录(命名为Datasets)中创建数据集框架,输入新数据集的名称后,将创建同名目录,并在其中建立四个子目录:训练集、测试集、验证集及存储目录。存储目录用于存放解压后的下载图像。下载的图像可能包含各种难以辨认的文件名和图像格式。本人编写了一款名为‘order_by_size’的Python程序,该程序针对存储目录中的下载图像进行操作,移除非jpg、png或bmp扩展名的文件,并删除低于用户指定图像大小的文件。然后,程序使用“零填充”的方式对文件进行顺序命名,并将它们转换为jpg格式,同时按文件大小顺序排列,确保首文件为最大图像尺寸,次文件为次大,依此类推。对于数据集中的图像,建议从大图像开始,因为这些图像将被裁剪至感兴趣区域,并确保裁剪后的图像具有足够的像素计数,以便分类模型提取特征。目前,文件已按顺序排列,并具有jpg扩展名,因此使用另一款名为‘duplicate delete’的程序,通过文件哈希检测重复图像并删除任何重复项。这防止了在文件分区时训练集、测试集和验证集中出现共同图像。进行谷歌搜索时,用户将获得大量所需内容,同时也将包含大量垃圾信息。本人编写了另一款名为‘review_images’的Python程序,该程序依次显示存储目录中的每张图像,用户可选择删除或保留符合所需类型的图像,从而消除存储目录中的不需要图像。接下来是更具挑战性的部分。若要构建高质量数据集,应将图像裁剪至感兴趣区域,以使裁剪后的图像具有高像素比。为此,本人使用Paint Shop Pro版本9。如果检查数据集图像,将发现大多数情况下,猫的图像至少占据了图像像素的50%。完成所有这些后,再次使用‘order_by_size’程序并设置不同参数,将所有图像转换为指定大小。对于本数据集,图像尺寸设置为224 X 224 X3。现在,我们已获得一组统一排序且经过适当修剪的图像,例如针对特定类别如老虎的图像。本人编写了另一款名为‘make_class’的Python程序,输入新的类别名称(例如老虎)后,将在训练集、测试集和验证集目录中创建新的类别子目录。然后,将存储目录中的图像分区为训练图像、测试图像和验证图像,并将它们存储在训练集、测试集和验证集目录的类别子目录中。最后,本人编写了另一款Python程序,用于创建数据集csv文件。构建高质量数据集需要大量工作,但本人所开发的工具有助于减轻工作负担。
提供机构:
www.kaggle.com
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作