student/birds_400

Name: student/birds_400
Creator: student
Published: 2022-04-18 03:15:55
License: 暂无描述

Hugging Face2022-04-18 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/student/birds_400

下载链接

链接失效反馈

官方服务：

资源简介：

鸟类400.物种图像分类 58388训练集，2000测试测试集，2000验证图像224X224X3 jpg格式 400种鸟类的数据集。58388张训练图像、2000张测试图像（每种5张图像）和2000张验证图像（每种5张图像）。这是一个非常高质量的数据集，每张图像中只有一只鸟，鸟通常占据图像中至少50%的像素。因此，即使是一个中等复杂的模型也能在90%的范围内实现训练和测试精度。所有图像均为jpg格式的224 X 224 X 3彩色图像。数据集包括列车集、测试集和验证集。每套包含400个子目录，每种鸟类一个。如果使用Keras ImageDataGenerator，则数据结构非常方便。flowfromdirectory创建列车、测试和有效数据生成器。数据集还包括一个鸟类物种档案。csv。此cvs文件包含三列。“文件路径”列包含图像文件的文件路径。“标签”列包含与图像文件关联的类名。鸟类种类。如果使用df=pandas读入csv文件。birdscsv（Bird Species.csv）将创建一个pandas数据帧，然后可以将其拆分为traindf、testdf和validdf数据帧，以创建您自己的数据划分为train、test和validdf数据集。注：数据集中的测试和验证图像是手工选择的“最佳”图像，因此使用这些数据集与创建自己的测试和验证集相比，您的模型可能会获得最高的准确度分数。然而，就看不见的图像上的模型性能而言，后一种情况更为准确。这些图片是通过网络搜索按物种名称收集的。下载一个物种的图像文件后，使用我开发的python duplicate image detector程序检查其重复图像。删除所有检测到的重复项，以防止它们在训练集、测试集和验证集之间成为共同的图像。之后，对图像进行裁剪，使鸟占据图像中至少50%的像素。然后，这些图像以jpg格式调整为224x224 X3。裁剪确保了当CNN对其进行处理时，图像中有足够的信息来创建高度准确的分类器。即使是一个中等稳健的模型，也应在高90%的范围内实现训练、验证和测试精度。由于数据集很大，我建议您尝试使用150 X 150 X3的模型和图像大小进行训练，以减少训练时间。所有文件也从每个物种的一个开始按顺序编号。所以测试图像被命名为1。jpg至5。jpg。对于验证图像也是如此。训练图像也用“零”填充顺序编号。例如001。jpg，002。jpg…010。jpg，011。jpg…。。099.jpg，100jpg，102。当与python文件函数和目录中的Keras流一起使用时，zero的填充保留了文件顺序。训练集是不平衡的，每个物种有不同数量的文件。然而，每个物种至少有120个训练图像文件。这种不平衡并没有影响我的内核分类器，因为它在测试集上达到了98%以上的准确率。数据集中一个显著的不平衡是雄性物种图像与雌性物种图像的比例。大约85%的图片是男性的，15%是女性的。典型的雄性动物的肤色要多样化得多，而一个物种的雌性动物通常是平淡无奇的。因此，男性和女性的形象可能看起来完全不同。几乎所有的测试和验证图像都来自该物种的雄性。因此，分类器可能无法在雌性物种图像上表现良好。

400-Species Bird Image Classification Dataset 58,388 training images, 2000 test images, and 2000 validation images, all in 224×224×3 JPEG format. This dataset covers 400 bird species, with 58,388 training images, 2000 test images (5 images per species), and 2000 validation images (5 images per species). This is a high-quality dataset where each image contains exactly one bird, and the bird typically occupies at least 50% of the total pixels in the image. As a result, even a moderately complex model can achieve training and test accuracies in the high 90% range. All images are 224×224×3 color images in JPEG format. The dataset includes a training set, a test set, and a validation set, each containing 400 subdirectories, one for each bird species. The data structure is highly convenient when using Keras' ImageDataGenerator, as flow_from_directory can be used to create training, test, and validation data generators. The dataset also includes a Bird Species.csv file. This CSV file contains three columns: the "File Path" column contains the file path of each image, and the "Label" column contains the class name (bird species) associated with the corresponding image. If you read the Bird Species.csv file into a pandas DataFrame using `pd.read_csv("Bird Species.csv")`, you will obtain a pandas DataFrame that can be split into train_df, test_df, and val_df to create your own custom training, test, and validation datasets. Note: The test and validation images in this dataset are manually selected "best" images. As a result, your model may achieve the highest possible accuracy scores when using these sets compared to creating your own test and validation splits. However, the latter scenario is more representative of the model's performance on unseen real-world images. These images were collected via web searches using species names as keywords. After downloading the image files for a given species, a custom Python duplicate image detector program I developed was used to identify and remove all duplicate images, preventing the same image from appearing across the training, test, and validation sets. Subsequently, all images were cropped such that the bird occupied at least 50% of the image's pixels, then resized to 224×224×3 in JPEG format. This cropping step ensures that sufficient informative content is present in the images for CNN-based classification, enabling the creation of highly accurate classifiers. Even a moderately robust model should achieve training, validation, and test accuracies in the high 90% range. Given the large scale of this dataset, I recommend experimenting with a 150×150×3 image size and corresponding model architecture to reduce training time. All files are sequentially numbered starting from 1 for each species. Test images are named 1.jpg through 5.jpg, and the same naming convention applies to validation images. Training images are sequentially numbered with zero-padding: for example, 001.jpg, 002.jpg, ..., 010.jpg, 011.jpg, ..., 099.jpg, 100.jpg, 102.jpg... This zero-padding preserves the file order when used with Python file I/O functions and Keras' flow_from_directory. The training set is imbalanced, with varying numbers of image files per species. However, every species has at least 120 training images. This imbalance did not negatively impact the classifier I developed, as it achieved an accuracy of over 98% on the test set. One notable imbalance in this dataset is the ratio of male to female bird images: approximately 85% of the images are of male birds, while only 15% are of female birds. Male birds typically have far more diverse plumage, whereas females of most species tend to have duller, less distinctive coloration. As a result, male and female images of the same species can look drastically different. Nearly all test and validation images are of male birds, so the classifier may perform poorly when presented with images of female bird species.

提供机构：

student

原始信息汇总

数据集概述

数据集名称

鸟类400.物种图像分类

数据集组成

训练集：58388张图像
测试集：2000张图像
验证集：2000张图像

图像规格

格式：jpg
尺寸：224 X 224 X 3（彩色）

数据集特点

每张图像中只有一只鸟，鸟通常占据图像中至少50%的像素。
图像通过网络搜索按物种名称收集，经过重复图像检测和裁剪处理。
训练集不平衡，每个物种至少有120个训练图像文件。
数据集中的性别比例不平衡，约85%为雄性，15%为雌性。

数据集使用建议

建议尝试使用150 X 150 X3的模型和图像大小进行训练，以减少训练时间。
由于测试和验证图像为手工选择的“最佳”图像，模型可能获得较高准确度分数，但在未见图像上的性能可能不准确。

数据集结构

每套包含400个子目录，每种鸟类一个。
包括一个鸟类物种档案.csv文件，包含“文件路径”和“标签”两列。

搜集汇总

数据集介绍

背景与挑战

背景概述

鸟类400数据集包含400种鸟类的图像，总计62388张（58388训练，2000测试，2000验证），均为224x224x3的jpg格式。数据集经过去重和裁剪处理，确保每张图像中鸟类占据至少50%的像素，适合图像分类任务。但需注意数据集存在雄性物种图像远多于雌性物种图像的不平衡问题。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集