five

eve-esa/corpus

收藏
Hugging Face2026-04-16 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/eve-esa/corpus
下载链接
链接失效反馈
官方服务:
资源简介:
EVE-Corpus是一个大规模、经过清理和匿名化的地球观测(EO)文档文本语料库,格式为Markdown。它旨在支持EO研究和领域特定的大型语言模型训练。该语料库包含27万个Markdown文件,总大小为15GB,来源于同行评审期刊、EO网站和科学存储库。数据集涵盖卫星任务、遥感技术、气候和大气科学、陆地、海洋和冰冻圈监测、环境建模和地理空间分析等主题。所有文档都经过去重、清理和匿名化处理,以移除个人信息。数据集包含42亿token,来自30多个EO相关来源。

EVE-Corpus is a large-scale, cleaned, and anonymized text corpus of Earth Observation (EO) documents formatted in Markdown. It is designed to support research in EO and domain-specific LLM training. The corpus contains 270k Markdown files with a total size of 15 GB, sourced from peer-reviewed journals, EO websites and scientific repositories. The dataset covers topics such as satellite missions, remote sensing techniques, climate and atmospheric science, land, ocean, and cryosphere monitoring, environmental modelling and geospatial analytics. All documents have been deduplicated, cleaned, and anonymized to remove personal information. The dataset includes 4.2B tokens from more than 30 EO-related sources.
提供机构:
eve-esa
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作