CNES ALCD Open water masks
收藏Mendeley Data2024-03-27 更新2024-06-28 收录
下载链接:
https://zenodo.org/record/4657020
下载链接
链接失效反馈官方服务:
资源简介:
"CNES ALCD Open water masks" is a reference dataset for water masks based on Sentinel-2 (L1C) images. This dataset generation has been funded by CNES under the SWOT-Downstream programme. Generation Method This dataset has been generated with the Active Learning for Cloud Detection (ALCD) software developed by CNES/Cesbio, that enables to generate any kind of reference mask using satellite images. This procedure involves between 1 or 2 hours of work to generate each reference image : create reference points on the image (water, land, cloud, snow...) manually, do the training (based on Random Forest of OTB) and prediction with ALCD, add new reference points for the most problematic areas, repeat new training/predictions as many times as necessary (usually 3-5 iterations), and finally, do a manual correction of persistent errors. Dataset format (raw masks) The dataset contains 26 files (scenes) at 10m resolution for 110km x 110km size. The content of pixels of the scene files (geotiff) follows the following naming rule 0 = Non Water observation (as land, snow) 1 = Open Water observation 255 = no data (as clouds) Format of file names: T{tile}_{YYYMMDD}_{site}_{season}.tif where : tile = reference Sentinel 2 tile (Cesbio post), YYYYMMDD = date of Sentinel 2 acquisition, site = name of the site, season = summer, winter Example : T30TXQ_20180201_Bordeaux_winter.tif T30UXU_20180708_Bretagne_summer.tif Dataset format (inland masks) This dataset has a version without coastal/ocean waters called "inland masks" aimed to characterize just inland waters. The dataset has been processed with the coastal lines of GSSHG layers : https://www.soest.hawaii.edu/pwessel/gshhg/ in H level, and using an erosion of 400m towards the continent. Thus, any pixel closer to the GSSHG coast line than 400m and beyond will be considered as "no data"(value=255). The format of the pixels content and file naming follow the same rules as in the "raw masks" version. Generation process (White book) A short description of an efficient usage of the Active Learning for Cloud Detection (ALCD) for surface water detection and extraction. Input data Sentinel‐2 bands: ‐ B2 : Blue ‐ B3 : Green ‐ B4 : Red ‐ B8 : NIR ‐ B11 : SWIR1 ‐ B12 : SWIR2 ‐ MNDWI ‐ Slope: derived from SRTM. Image visual analysis It is recommended to go through the image visually to identify the different types of water bodies and land covers present on the scene. Choosing samples So far, only the « water_1pixel.shp » and « land_1pixel.shp » have been used to host the samples. More precise, they allow to identify pixels which will be used during the algorithm training. Even though the use of these layers requires a larger number of sampling points, they make the selection more precise and give a higher control over the input. During the first iteration: ‐ Chose around 15 sampling points representing the land cover diversity present in the scene (different types of water, turbidity etc.). ‐ Prefer “pure” pixels. Further recommendations ‐ Try not to exceed 7 iterations - Multiplying iterations means a higher number of samples, which increases the chances of introducing false samples and then compromising the classification quality. - The aim is to find the best compromise between omission and commission. - It’s not about having a perfect classification but rather finding the classification that will minimize the post‐processing time. ‐ Adding new samples may degrade the classification. Using a dark shadow as “not water”, even though the pixel is spectrally very close to a “water” pixel, will disturb the model and lea to a classification of lesser quality than the previous iteration. If the result of an iteration is significantly worse than the previous iteration, it might be wiser to start again from this previous iteration rather than continuing with the problematic added samples. ‐ As said previously, the user should avoid to add any type of shadow in the « not water » class. Using such pixels will reduce the model efficiency at extracting water. During production, it’s easier to stick to one type of error, usually commissions, and making sure that all water bodies are correctly classified. Falsely detected shadows can be corrected easily during the post‐processing step. ‐ More generally, it’s easier to deal with commissions than omission.
CNES ALCD 公开水面掩膜数据集(CNES ALCD Open water masks)是基于哨兵二号(Sentinel-2)L1C级影像构建的水面掩膜参考数据集。本数据集的生成由法国国家空间研究中心(Centre National d'Études Spatiales, CNES)在SWOT-Downstream项目框架下资助完成。
### 生成方法
本数据集由法国国家空间研究中心/法国海洋开发研究院(CNES/Cesbio)开发的主动学习云检测(Active Learning for Cloud Detection, ALCD)软件生成,该软件可基于卫星影像生成各类参考掩膜。该流程生成单幅参考影像需耗时1至2小时:首先在影像上手动标注参考点(涵盖水体、陆地、云、积雪等类别),随后基于奥费奥工具箱(Orfeo ToolBox, OTB)的随机森林(Random Forest)模型开展训练与ALCD预测,针对问题突出区域新增参考点,并按需重复训练与预测流程(通常需3至5轮迭代),最终对持续存在的错误进行人工校正。
### 原始掩膜数据集格式
本数据集包含26幅分辨率为10m、尺寸为110km×110km的影像场景文件。场景文件为地理标记图像文件格式(GeoTIFF),其像素值遵循以下规则:0代表非水体观测值(如陆地、积雪),1代表公开水面观测值,255代表无数据(如云覆盖区域)。
文件名遵循如下命名规则:`T{tile}_{YYYMMDD}_{site}_{season}.tif`,其中:
- tile为参考哨兵二号影像瓦片(Cesbio后处理标识)
- YYYYMMDD为哨兵二号影像获取日期
- site为研究区域名称
- season为季节,取值为summer(夏季)或winter(冬季)
示例如下:
`T30TXQ_20180201_Bordeaux_winter.tif`
`T30UXU_20180708_Bretagne_summer.tif`
### 内陆掩膜数据集格式
本数据集另有不含近岸/远洋水体的版本,命名为"内陆掩膜",仅用于表征内陆水域。该版本基于全球自洽分层高分辨率地理数据库(Global Self-Consistent, Hierarchical, High-resolution Geography Database, GSSHG)的海岸线数据进行处理:采用https://www.soest.hawaii.edu/pwessel/gshhg/ 提供的H级海岸线数据,并向陆地方向进行400m的侵蚀处理。因此,距离GSSHG海岸线400m以内及以外的像素将被视为无数据(像素值为255)。其像素值规则与文件名命名规则均与"原始掩膜"版本一致。
### 生成流程(白皮书版)
本部分简要介绍了主动学习云检测(Active Learning for Cloud Detection, ALCD)在地表水体检测与提取中的高效应用方法。
#### 输入数据
哨兵二号影像波段:
- B2:蓝波段
- B3:绿波段
- B4:红波段
- B8:近红外波段(Near Infrared, NIR)
- B11:短波红外1波段(Short Wave Infrared 1, SWIR1)
- B12:短波红外2波段(Short Wave Infrared 2, SWIR2)
- 改进型归一化差异水体指数(Modified Normalized Difference Water Index, MNDWI)
- 坡度:由航天飞机雷达地形测绘任务(Shuttle Radar Topography Mission, SRTM)数据衍生得到。
#### 影像目视解译
建议对影像开展目视解译,以识别场景中存在的各类水体与土地覆盖类型。
#### 样本选取
目前仅使用`water_1pixel.shp`与`land_1pixel.shp`两个矢量图层存储样本,二者可用于明确算法训练阶段将使用的像素。尽管使用此类图层需要更多采样点,但可使样本选择更为精准,并提升对输入数据的可控性。
在首轮迭代中:
- 选取约15个采样点,以表征场景内的土地覆盖多样性(涵盖不同类型水体、浊度等特征)。
- 优先选择“纯”像素。
#### 额外建议
- 建议迭代次数不超过7轮:迭代次数越多,所需采样点数量越多,引入虚假样本的概率也随之升高,进而降低分类质量。
- 目标是在漏检与误检之间找到最优平衡点。
- 无需追求完美的分类结果,而是应找到可最小化后处理耗时的分类方案。
- 新增样本可能会降低分类性能。若将阴影标记为“非水体”,尽管该像素的光谱特征与水体像素极为相似,仍会干扰模型,导致分类质量较上一轮迭代下降。若某一轮迭代的结果显著劣于上一轮,建议从该轮之前的结果重新开始,而非继续使用存在问题的新增样本。
- 如前所述,用户应避免将任何类型的阴影归入「非水体」类别。使用此类像素会降低模型提取水体的效率。在生产阶段,优先处理一类错误(通常为误检),并确保所有水体均被正确分类即可。误检的阴影可在后处理阶段轻松校正。
- 总体而言,处理误检比处理漏检更为简便。
创建时间:
2023-06-28
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是基于Sentinel-2图像的开放水掩膜参考数据集,包含26个10米分辨率的场景,覆盖110km x 110km区域,提供原始和内陆两种水掩膜格式。数据集采用ALCD软件生成,结合了手动标注和机器学习方法,适用于水体检测和提取研究。
以上内容由遇见数据集搜集并总结生成



