资源简介:
---
license: cc-by-4.0
task_categories:
- text-to-image
language:
- en
tags:
- climate
size_categories:
- 100K<n<1M
---
# Dataset Card for LAION-EO
## Dataset Description
- **Point of Contact:** Mikolaj Czerkawski, mikolaj.czerkawski@esa.int
### Dataset Summary
This dataset contains a subset of LAION-5B containing images that are likely to be satellite images. The procedure of acquiring and filtering the dataset has been described in https://arxiv.org/abs/2309.15535.
|Version|Number of Samples|
|:---|:---|
| 0 | 24,933 |
| 1 | 112,985 |
## Dataset Structure
Each version of the dataset contains a .csv file with metadata with urls to images, which can be easily filtered. Note that the linked images could be copyrighted.
### Data Fields
|Field|Description|
|:---|:---|
|**source**| Index of the anchor sample |
|**url**| Link to the image |
|**filename**| Locally saved unique filename |
|**id**| Original ID |
|**fast_similarity**| Fast similarity to the anchor image computed with https://github.com/rom1504/clip-retrieval |
|**caption**| Text caption |
|**image_similarity**| CLIP similarity to the original anchor image |
|**text_similarity**| CLIP similarity to the text "a satellite image" |
|**height**| height of the image at url |
|**width**| Width of the image at url |
|**lang**| Language predicted using https://huggingface.co/papluca/xlm-roberta-base-language-detection |
|**lang_score**| A measure of confidence in the predicted language |
### Example Samples

### Data Splits
No official splitting of the dataset is used.
## Dataset Creation
The creation of the prototype version is described in https://arxiv.org/abs/2309.15535.
### Curation Rationale
Extraction of samples in LAION-5B relevant to Earth observation tasks.
### Source Data
Samples from the existing LAION-5B dataset (https://laion.ai/blog/laion-5b/).
### Discussion of Biases
Only contains satellite images openly uploaded online, which introduces a heavy bias towards satellite images used for communicating ideas on the internet.
### Citation Information
The workshop paper presented at the DataComp workshop during ICCV 2023 is available at https://arxiv.org/abs/2309.15535.
```latex
@inproceedings{LAION_EO,
title={From LAION-5B to LAION-EO: Filtering Billions of Images Using Anchor Datasets for Satellite Image Extraction},
author={Mikolaj Czerkawski and Alistair Francis},
year={2023},
eprint={2309.15535},
archivePrefix={arXiv},
primaryClass={cs.CV}
booktitle = {"Towards the Next Generation of Computer Vision Datasets: DataComp Track" Workshop at the IEEE/CVF International Conference on Computer Vision (ICCV)}
}
```
### License
We distribute the metadata dataset (the parquet files) under the Creative Common CC-BY 4.0 license, which poses no particular restriction. The images are under their copyright.
### Contributions
Design and Curation: Mikolaj Czerkawski
---
许可证: CC BY 4.0
任务类别:
- 文本到图像(text-to-image)
语言:
- 英语
标签:
- 气候(climate)
规模类别:
- 10万<样本数量<100万
---
# LAION-EO 数据集卡片
## 数据集说明
- **联系方式:** 米科拉伊·切尔考夫斯基(Mikolaj Czerkawski),邮箱:mikolaj.czerkawski@esa.int
### 数据集概述
本数据集为LAION-5B的子集,包含疑似卫星图像(satellite image)的图片。数据集的获取与筛选流程已在https://arxiv.org/abs/2309.15535中详细阐述。
| 版本 | 样本数量 |
|:---|:---|
| 0 | 24,933 |
| 1 | 112,985 |
## 数据集结构
每个版本的数据集均包含一个存储元数据的.csv文件,其中附有图片URL,便于快速筛选。请注意,链接的图片可能受版权保护。
### 数据字段
| 字段 | 描述 |
|:---|:---|
| **source** | 锚定样本的索引 |
| **url** | 图片链接 |
| **filename** | 本地保存的唯一文件名 |
| **id** | 原始ID |
| **fast_similarity** | 通过https://github.com/rom1504/clip-retrieval计算得到的、与锚定图片的快速相似度 |
| **caption** | 文本标题 |
| **image_similarity** | 与原始锚定图片的CLIP (Contrastive Language-Image Pre-training) 相似度 |
| **text_similarity** | 与文本"卫星图像(satellite image)"的CLIP相似度 |
| **height** | 对应URL中图片的高度 |
| **width** | 对应URL中图片的宽度 |
| **lang** | 通过https://huggingface.co/papluca/xlm-roberta-base-language-detection预测得到的语言 |
| **lang_score** | 预测语言的置信度得分 |
### 示例样本

### 数据划分
本数据集未设置官方划分方式。
## 数据集构建
原型版本的构建流程已在https://arxiv.org/abs/2309.15535中详细阐述。
### 筛选依据
提取LAION-5B中与地球观测任务相关的样本。
### 数据源
样本取自现有LAION-5B数据集(https://laion.ai/blog/laion-5b/)。
### 偏差说明
本数据集仅包含公开上传至网络的卫星图像,因此存在显著偏差:偏向于互联网上用于传播观点的卫星图像。
### 引用信息
发表于2023年IEEE/CVF国际计算机视觉大会(ICCV 2023)DataComp研讨会的工作论文可在https://arxiv.org/abs/2309.15535获取。
latex
@inproceedings{LAION_EO,
title={From LAION-5B to LAION-EO: Filtering Billions of Images Using Anchor Datasets for Satellite Image Extraction},
author={Mikolaj Czerkawski and Alistair Francis},
year={2023},
eprint={2309.15535},
archivePrefix={arXiv},
primaryClass={cs.CV}
booktitle = {"Towards the Next Generation of Computer Vision Datasets: DataComp Track" Workshop at the IEEE/CVF International Conference on Computer Vision (ICCV)}
}
### 许可证
本元数据集(parquet文件)采用知识共享CC BY 4.0许可证进行分发,无特殊限制。图片版权归原作者所有。
### 贡献者
设计与筛选:米科拉伊·切尔考夫斯基(Mikolaj Czerkawski)