GeoGPT-Research-Project/GeoGPT_Training_Data_from_Geoscience_Subset_of_CommonCrawl

Name: GeoGPT-Research-Project/GeoGPT_Training_Data_from_Geoscience_Subset_of_CommonCrawl
Creator: GeoGPT-Research-Project
Published: 2025-04-22 06:54:38
License: 暂无描述

Hugging Face2025-04-22 更新2025-10-25 收录

下载链接：

https://hf-mirror.com/datasets/GeoGPT-Research-Project/GeoGPT_Training_Data_from_Geoscience_Subset_of_CommonCrawl

下载链接

链接失效反馈

官方服务：

资源简介：

这个数据集是一个针对地球科学领域的特定子集，从CommonCrawl中提取出来，用于GeoGPT训练。CommonCrawl是一个拥有超过2500亿网页的免费开放网络爬虫数据仓库，被许多大型语言模型广泛使用。该数据集包含12,414,268个样本，每个样本都包含可以追溯其在CommonCrawl中数据源的元数据。这个数据集主要用于支持地球科学研究，作为大型语言模型的训练语料库，专为非商业性质的研究和教育目的设计。

This dataset is a geoscience-specific subset extracted from CommonCrawl for GeoGPT training. CommonCrawl is a free and open repository of web crawl data with over 250 billion web pages, widely used by leading large language models. The dataset consists of 12,414,268 samples, each with metadata to trace the data source within CommonCrawl. It is primarily intended to support geoscience research as a training corpus for large language models, designed specifically for non-commercial research and educational purposes.

提供机构：

GeoGPT-Research-Project

5,000+

优质数据集

54 个

任务类型

进入经典数据集