five

CC-Foundation|遥感图像数据集|变化检测数据集

收藏
arXiv2024-11-18 更新2024-11-20 收录
遥感图像
变化检测
下载链接:
https://github.com/Meize0729/CCExpert
下载链接
链接失效反馈
资源简介:
CC-Foundation数据集是由北京航空航天大学创建的高质量多样化数据集,专门用于遥感图像变化描述任务。该数据集包含200,000对多时相遥感图像和120万条自然语言描述,涵盖了广泛的场景和变化类型。数据集的创建过程结合了多种开源数据集的优化、基于变化检测数据集的扩展以及领域专家的注释,确保了数据集的多样性和挑战性。CC-Foundation数据集主要应用于环境监测和灾害管理等领域,旨在通过提供精确和深入的变化描述来增强对动态地表变化的监测和理解。
提供机构:
北京航空航天大学
创建时间:
2024-11-18
原始信息汇总

CCExpert 数据集概述

数据集介绍

CCExpert 数据集是一个用于遥感变化描述的大型数据集,名为 "CC-Foundation Dataset"。该数据集旨在支持多模态语言学习模型(MLLM)在遥感变化描述任务中的能力提升。

数据集下载

数据集的部分内容已开源,可通过以下链接下载:

数据处理步骤

  1. 下载并解压数据集:从上述链接下载数据集,并解压压缩包。
  2. 生成JSON文件:使用 add_sbsolute_path_to_all_json.py 脚本生成包含所有JSON数据文件的子文件夹,并将图像路径从相对路径转换为绝对路径。
  3. 更新YAML文件:使用 add_CC_Foundation_local_absolute_path_to_yaml.py 脚本将CC-Foundation的绝对路径添加到模板YAML文件中,以便查找相应的JSON标注文件。

数据集使用许可

使用该数据集时,请遵守各自数据集的许可协议。

AI搜集汇总
数据集介绍
main_image_url
构建方式
CC-Foundation数据集的构建方式体现了对多样化和高质数据的追求。首先,该数据集整合了多个开源的变化描述数据集,如CLVER-Change、ImageEdit-Request、Spot-the-diff、stvchrono、Vismin和LEVIR-CC,这些数据集经过大型语言模型(如GPT-4o)的进一步精炼和优化,以提升标注的准确性和表达的多样性。其次,基于变化检测数据集(如ChangeSim和SYSU-CD),利用变化掩码作为提示,通过多轮对话生成详细的变化描述。最后,引入SECOND数据集,该数据集包含多种语义变化图像对,并由领域专家进行精细标注,以增强数据集的多样性和挑战性。
特点
CC-Foundation数据集的显著特点在于其大规模、多样性和高质量。该数据集包含200,000对图像和120万条标注,涵盖了从自然图像到遥感图像的广泛领域。通过整合和优化多个开源数据集,并结合大型语言模型和专家标注,数据集不仅在数量上达到大规模,而且在标注质量和多样性上也达到了高水平。这种多样性和高质量的标注使得数据集在支持遥感图像变化描述任务上具有显著优势。
使用方法
CC-Foundation数据集主要用于支持多模态大语言模型(MLLM)在遥感图像变化描述任务中的继续预训练。使用该数据集时,首先需要对模型进行多阶段的预训练,以确保模型能够深度整合差异感知模块和预训练的MLLM。具体步骤包括:第一阶段,仅训练差异捕捉和注入模块,冻结图像编码器和大语言模型的参数;第二阶段,解冻所有模型参数,优化语言模型对图像特征的理解和文本生成;第三阶段,在特定领域数据上进行训练,以确保模型在实际应用中达到最佳性能。通过这种三阶段的训练策略,CC-Foundation数据集能够显著提升模型在遥感图像变化描述任务中的表现。
背景与挑战
背景概述
CC-Foundation数据集由北京航空航天大学的研究团队创建,旨在推动遥感图像变化描述(RSICC)领域的发展。该数据集包含200,000对图像和120万条描述,涵盖了多种地表变化,如建筑物的新增或消失。其核心研究问题是如何利用多模态大语言模型(MLLMs)的长期序列理解和推理能力,生成自然语言描述,详细说明多时相遥感图像之间的变化。CC-Foundation数据集的构建不仅为RSICC任务提供了丰富的数据支持,还显著提升了模型在该领域的性能,推动了遥感图像变化分析技术在环境监测和灾害管理中的应用。
当前挑战
CC-Foundation数据集在构建过程中面临多重挑战。首先,如何有效地整合和优化来自多个开源数据集的数据,确保数据的高质量和多样性,是一个重要问题。其次,利用GPT-4o生成变化描述时,如何确保生成的描述准确且符合实际变化,也是一个技术难题。此外,数据集的构建需要大量的计算资源和时间,如何在有限的资源下高效地完成数据集的构建和优化,是另一个挑战。最后,如何设计有效的训练策略,确保模型能够充分利用数据集中的信息,提升其在实际应用中的表现,也是一项关键挑战。
常用场景
经典使用场景
CC-Foundation数据集在遥感图像变化描述领域中具有经典应用场景,主要用于生成多时相遥感图像之间的自然语言描述,详细描述变化对象的类别、位置和动态(如新增或消失)。该数据集通过提供高质量、多样化的图像对和相应的描述,支持多模态大语言模型(MLLMs)在遥感图像变化描述任务中的预训练和微调,从而提升模型对长序列理解和推理的能力。
解决学术问题
CC-Foundation数据集解决了当前多模态大语言模型在遥感图像变化描述任务中缺乏全面数据支持的问题。通过提供20万对图像和120万条描述,该数据集显著增强了模型的基础能力,避免了因数据不足导致的模型内在知识破坏和性能受限。这不仅提升了模型的泛化能力,还为相关领域的学术研究提供了丰富的数据资源。
衍生相关工作
基于CC-Foundation数据集,衍生了许多相关工作,包括但不限于改进的多模态大语言模型架构设计、差异感知集成模块的优化以及三阶段渐进式训练策略的研究。这些工作不仅提升了模型在遥感图像变化描述任务中的性能,还推动了多模态学习在其他跨模态任务中的应用,如图像描述和视觉问答。
以上内容由AI搜集并总结生成
用户留言
有没有相关的论文或文献参考?
这个数据集是基于什么背景创建的?
数据集的作者是谁?
能帮我联系到这个数据集的作者吗?
这个数据集如何下载?
点击留言
数据主题
具身智能
数据集  4098个
机构  8个
大模型
数据集  439个
机构  10个
无人机
数据集  37个
机构  6个
指令微调
数据集  36个
机构  6个
蛋白质结构
数据集  50个
机构  8个
空间智能
数据集  21个
机构  5个
5,000+
优质数据集
54 个
任务类型
进入经典数据集
热门数据集

中国区域交通网络数据集

该数据集包含中国各区域的交通网络信息,包括道路、铁路、航空和水路等多种交通方式的网络结构和连接关系。数据集详细记录了各交通节点的位置、交通线路的类型、长度、容量以及相关的交通流量信息。

data.stats.gov.cn 收录

网易云音乐数据集

该数据集包含了网易云音乐平台上的歌手信息、歌曲信息和歌单信息,数据通过爬虫技术获取并整理成CSV格式,用于音乐数据挖掘和推荐系统构建。

github 收录

UniProt

UniProt(Universal Protein Resource)是全球公认的蛋白质序列与功能信息权威数据库,由欧洲生物信息学研究所(EBI)、瑞士生物信息学研究所(SIB)和美国蛋白质信息资源中心(PIR)联合运营。该数据库以其广度和深度兼备的蛋白质信息资源闻名,整合了实验验证的高质量数据与大规模预测的自动注释内容,涵盖从分子序列、结构到功能的全面信息。UniProt核心包括注释详尽的UniProtKB知识库(分为人工校验的Swiss-Prot和自动生成的TrEMBL),以及支持高效序列聚类分析的UniRef和全局蛋白质序列归档的UniParc。其卓越的数据质量和多样化的检索工具,为基础研究和药物研发提供了无可替代的支持,成为生物学研究中不可或缺的资源。

www.uniprot.org 收录

WideIRSTD Dataset

WideIRSTD数据集包含七个公开数据集:SIRST-V2、IRSTD-1K、IRDST、NUDT-SIRST、NUDT-SIRST-Sea、NUDT-MIRSDT、Anti-UAV,以及由国防科技大学团队开发的数据集,包括模拟陆基和太空基数据,以及真实手动标注的太空基数据。数据集包含具有各种目标形状(如点目标、斑点目标、扩展目标)、波长(如近红外、短波红外和热红外)、图像分辨率(如256、512、1024、3200等)的图像,以及不同的成像系统(如陆基、空基和太空基成像系统)。

github 收录

Canadian Census

**Overview** The data package provides demographics for Canadian population groups according to multiple location categories: Forward Sortation Areas (FSAs), Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs), Federal Electoral Districts (FEDs), Health Regions (HRs) and provinces. **Description** The data are available through the Canadian Census and the National Household Survey (NHS), separated or combined. The main demographic indicators provided for the population groups, stratified not only by location but also for the majority by demographical and socioeconomic characteristics, are population number, females and males, usual residents and private dwellings. The primary use of the data at the Health Region level is for health surveillance and population health research. Federal and provincial departments of health and human resources, social service agencies, and other types of government agencies use the information to monitor, plan, implement and evaluate programs to improve the health of Canadians and the efficiency of health services. Researchers from various fields use the information to conduct research to improve health. Non-profit health organizations and the media use the health region data to raise awareness about health, an issue of concern to all Canadians. The Census population counts for a particular geographic area representing the number of Canadians whose usual place of residence is in that area, regardless of where they happened to be on Census Day. Also included are any Canadians who were staying in that area on Census Day and who had no usual place of residence elsewhere in Canada, as well as those considered to be 'non-permanent residents'. National Household Survey (NHS) provides demographic data for various levels of geography, including provinces and territories, census metropolitan areas/census agglomerations, census divisions, census subdivisions, census tracts, federal electoral districts and health regions. In order to provide a comprehensive overview of an area, this product presents data from both the NHS and the Census. NHS data topics include immigration and ethnocultural diversity; aboriginal peoples; education and labor; mobility and migration; language of work; income and housing. 2011 Census data topics include population and dwelling counts; age and sex; families, households and marital status; structural type of dwelling and collectives; and language. The data are collected for private dwellings occupied by usual residents. A private dwelling is a dwelling in which a person or a group of persons permanently reside. Information for the National Household Survey does not include information for collective dwellings. Collective dwellings are dwellings used for commercial, institutional or communal purposes, such as a hotel, a hospital or a work camp. **Benefits** - Useful for canada public health stakeholders, for public health specialist or specialized public and other interested parties. for health surveillance and population health research. for monitoring, planning, implementation and evaluation of health-related programs. media agencies may use the health regions data to raise awareness about health, an issue of concern to all canadians. giving the addition of longitude and latitude in some of the datasets the data can be useful to transpose the values into geographical representations. the fields descriptions along with the dataset description are useful for the user to quickly understand the data and the dataset. **License Information** The use of John Snow Labs datasets is free for personal and research purposes. For commercial use please subscribe to the [Data Library](https://www.johnsnowlabs.com/marketplace/) on John Snow Labs website. The subscription will allow you to use all John Snow Labs datasets and data packages for commercial purposes. **Included Datasets** - [Canadian Population and Dwelling by FSA 2011](https://www.johnsnowlabs.com/marketplace/canadian-population-and-dwelling-by-fsa-2011) - This Canadian Census dataset covers data on population, total private dwellings and private dwellings occupied by usual residents by forward sortation area (FSA). It is enriched with the percentage of the population or dwellings versus the total amount as well as the geographical area, province, and latitude and longitude. The whole Canada's population is marked as 100, referring to 100% for the percentages. - [Detailed Canadian Population Statistics by CMAs and CAs 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-cmas-and-cas-2011) - This dataset covers the population statistics of Canada by Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs). It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by FED 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-fed-2011) - This dataset covers the population statistics of Canada from 2011 by Federal Electoral District of 2013 Representation Order. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Health Region 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-health-region-2011) - This dataset covers the population statistics of Canada by health region. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Province 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-province-2011) - This dataset covers the population statistics of Canada by provinces and territories. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. **Data Engineering Overview** **We deliver high-quality data** - Each dataset goes through 3 levels of quality review - 2 Manual reviews are done by domain experts - Then, an automated set of 60+ validations enforces every datum matches metadata & defined constraints - Data is normalized into one unified type system - All dates, unites, codes, currencies look the same - All null values are normalized to the same value - All dataset and field names are SQL and Hive compliant - Data and Metadata - Data is available in both CSV and Apache Parquet format, optimized for high read performance on distributed Hadoop, Spark & MPP clusters - Metadata is provided in the open Frictionless Data standard, and its every field is normalized & validated - Data Updates - Data updates support replace-on-update: outdated foreign keys are deprecated, not deleted **Our data is curated and enriched by domain experts** Each dataset is manually curated by our team of doctors, pharmacists, public health & medical billing experts: - Field names, descriptions, and normalized values are chosen by people who actually understand their meaning - Healthcare & life science experts add categories, search keywords, descriptions and more to each dataset - Both manual and automated data enrichment supported for clinical codes, providers, drugs, and geo-locations - The data is always kept up to date – even when the source requires manual effort to get updates - Support for data subscribers is provided directly by the domain experts who curated the data sets - Every data source’s license is manually verified to allow for royalty-free commercial use and redistribution. **Need Help?** If you have questions about our products, contact us at [info@johnsnowlabs.com](mailto:info@johnsnowlabs.com).

Databricks 收录