sri-um/PureForest-NL
收藏Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/sri-um/PureForest-NL
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: image
dtype: image
- name: label
dtype:
class_label:
names:
'0': '0'
'1': '10'
'2': '12'
'3': '13'
'4': '14'
'5': '15'
'6': '2'
'7': '6'
'8': '9'
- name: DG
dtype: int64
splits:
- name: train
num_bytes: 4359731302
num_examples: 31584
- name: validation
num_bytes: 936742270
num_examples: 6932
- name: test
num_bytes: 1578307060
num_examples: 10653
download_size: 6876558734
dataset_size: 6874780632
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
tags:
- climate
size_categories:
- 10K<n<100K
---
The data source for the images and their labels is: https://huggingface.co/datasets/IGNF/PureForest
The researchers provide 18 different tree species and their images
But all of these 18 spcies don't exist in the Netherlands, according to the national forest inventory we only have
Abies alba
Castanea sativa
Fagus sylvatica
Larix decidua
Picea abies
Pinus sylvestris
Pseudotsuga menziesii
Quercus robur
Quercus petraea
Quercus rubra
Robinia pseudocacia
I manually download the imagery folders for these species from the link -- in total this amounts to 50,000 images
The original file structure is like this:
Imagery-Abies-Alba\
\train\
\val\
\test\
For most of the files I don't need to change this. But for the three Quercus species, I pull all the train (val,test) images into the train (val,test) splits of Quercus robur -- then drop the robur distinction. Manual checking confirms that the total number of images are preserved
A huggingface dataset expects the data in the following manner:
/train/label/images
I manually change the folder structure by copy-pasting into a new seperate folder
Before doing this step, we need to convert the labels from strings to integers and use the same label map from when we made the original Dominant Genus splits in the Dutch dataset
The original set is as follows:
label_map = {
'Quercus': 0,
'Alnus': 1,
'Pinus': 2,
'Betula': 3,
'Acer': 4,
'Fraxinus': 5,
'Fagus': 6,
'Populus': 7,
'Salix': 8,
'Larix': 9,
'Pseudotsuga': 10,
'Sorbus': 11,
'Picea': 12,
'Castanea': 13,
'Abies': 14,
'Robinia': 15,
'Ulmus': 16,
'Carpinus': 16,
'Prunus': 16,
'Tilia': 16,
'Aesculus': 16,
'Tsuga': 16
}
In the case of the French dataset we are left with
label_map = {
'Quercus': 0,
'Pinus': 2,
'Fagus': 6,
'Larix': 9,
'Pseudotsuga': 10,
'Picea': 12,
'Castanea': 13,
'Abies': 14,
'Robinia': 15}
We have one last change to make before we push our data onto Huggingface, the images from the french data come in 250 x 250 pixels, we need to resize them to 224 x 224 before that
The script to do this in place is:
from pathlib import Path
from PIL import Image
data_dir = Path("../../data/PureForest_NL")
images = list(data_dir.rglob("*.tiff"))
print(f"Found {len(images)} images")
for i, img_path in enumerate(images):
with Image.open(img_path) as img:
img = img.resize((224, 224), Image.BICUBIC)
img.save(img_path)
Once the data is prepared in the correct format, loading it into a dataset object is as easy as:
from datasets import load_dataset
dataset = load_dataset("imagefolder", data_dir="/path/to/pokemon")
There is a lot going on here, so it's best to check the work properly running some good tests :)
I've named the folders with their label numbers based on the original label map, but since the French dataset only contains a subset of the classes, the resulting label dsitribution is not continuous
Huggingface (HF) takes care of this by re-naming the labels in a continuous range, in this case from 0-8, but storing how it converted between integer labels and the string classes
I will force the classifier head to have 16 classes (same as Dutch) and it's therefore important for me to use the same number of heads here. Only that way can I train the right set of weights.
Luckily I can easily access the string representation of HF's label mapping using the `int2str` method and apply it to all the rows using the `map` method
提供机构:
sri-um



