five

sri-um/PureForest-NL

收藏
Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/sri-um/PureForest-NL
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: image dtype: image - name: label dtype: class_label: names: '0': '0' '1': '10' '2': '12' '3': '13' '4': '14' '5': '15' '6': '2' '7': '6' '8': '9' - name: DG dtype: int64 splits: - name: train num_bytes: 4359731302 num_examples: 31584 - name: validation num_bytes: 936742270 num_examples: 6932 - name: test num_bytes: 1578307060 num_examples: 10653 download_size: 6876558734 dataset_size: 6874780632 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* tags: - climate size_categories: - 10K<n<100K --- The data source for the images and their labels is: https://huggingface.co/datasets/IGNF/PureForest The researchers provide 18 different tree species and their images But all of these 18 spcies don't exist in the Netherlands, according to the national forest inventory we only have Abies alba Castanea sativa Fagus sylvatica Larix decidua Picea abies Pinus sylvestris Pseudotsuga menziesii Quercus robur Quercus petraea Quercus rubra Robinia pseudocacia I manually download the imagery folders for these species from the link -- in total this amounts to 50,000 images The original file structure is like this: Imagery-Abies-Alba\ \train\ \val\ \test\ For most of the files I don't need to change this. But for the three Quercus species, I pull all the train (val,test) images into the train (val,test) splits of Quercus robur -- then drop the robur distinction. Manual checking confirms that the total number of images are preserved A huggingface dataset expects the data in the following manner: /train/label/images I manually change the folder structure by copy-pasting into a new seperate folder Before doing this step, we need to convert the labels from strings to integers and use the same label map from when we made the original Dominant Genus splits in the Dutch dataset The original set is as follows: label_map = { 'Quercus': 0, 'Alnus': 1, 'Pinus': 2, 'Betula': 3, 'Acer': 4, 'Fraxinus': 5, 'Fagus': 6, 'Populus': 7, 'Salix': 8, 'Larix': 9, 'Pseudotsuga': 10, 'Sorbus': 11, 'Picea': 12, 'Castanea': 13, 'Abies': 14, 'Robinia': 15, 'Ulmus': 16, 'Carpinus': 16, 'Prunus': 16, 'Tilia': 16, 'Aesculus': 16, 'Tsuga': 16 } In the case of the French dataset we are left with label_map = { 'Quercus': 0, 'Pinus': 2, 'Fagus': 6, 'Larix': 9, 'Pseudotsuga': 10, 'Picea': 12, 'Castanea': 13, 'Abies': 14, 'Robinia': 15} We have one last change to make before we push our data onto Huggingface, the images from the french data come in 250 x 250 pixels, we need to resize them to 224 x 224 before that The script to do this in place is: from pathlib import Path from PIL import Image data_dir = Path("../../data/PureForest_NL") images = list(data_dir.rglob("*.tiff")) print(f"Found {len(images)} images") for i, img_path in enumerate(images): with Image.open(img_path) as img: img = img.resize((224, 224), Image.BICUBIC) img.save(img_path) Once the data is prepared in the correct format, loading it into a dataset object is as easy as: from datasets import load_dataset dataset = load_dataset("imagefolder", data_dir="/path/to/pokemon") There is a lot going on here, so it's best to check the work properly running some good tests :) I've named the folders with their label numbers based on the original label map, but since the French dataset only contains a subset of the classes, the resulting label dsitribution is not continuous Huggingface (HF) takes care of this by re-naming the labels in a continuous range, in this case from 0-8, but storing how it converted between integer labels and the string classes I will force the classifier head to have 16 classes (same as Dutch) and it's therefore important for me to use the same number of heads here. Only that way can I train the right set of weights. Luckily I can easily access the string representation of HF's label mapping using the `int2str` method and apply it to all the rows using the `map` method
提供机构:
sri-um
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作