eve-esa/corpus
收藏Hugging Face2026-04-16 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/eve-esa/corpus
下载链接
链接失效反馈官方服务:
资源简介:
EVE-Corpus是一个大规模、经过清理和匿名化的地球观测(EO)文档文本语料库,格式为Markdown。它旨在支持EO研究和领域特定的大型语言模型训练。该语料库包含27万个Markdown文件,总大小为15GB,来源于同行评审期刊、EO网站和科学存储库。数据集涵盖卫星任务、遥感技术、气候和大气科学、陆地、海洋和冰冻圈监测、环境建模和地理空间分析等主题。所有文档都经过去重、清理和匿名化处理,以移除个人信息。数据集包含42亿token,来自30多个EO相关来源。
EVE-Corpus is a large-scale, cleaned, and anonymized text corpus of Earth Observation (EO) documents formatted in Markdown. It is designed to support research in EO and domain-specific LLM training. The corpus contains 270k Markdown files with a total size of 15 GB, sourced from peer-reviewed journals, EO websites and scientific repositories. The dataset covers topics such as satellite missions, remote sensing techniques, climate and atmospheric science, land, ocean, and cryosphere monitoring, environmental modelling and geospatial analytics. All documents have been deduplicated, cleaned, and anonymized to remove personal information. The dataset includes 4.2B tokens from more than 30 EO-related sources.
提供机构:
eve-esa



