Meriem-DH/marine-dataset-cpt
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Meriem-DH/marine-dataset-cpt
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-generation
language:
- en
tags:
- ocean
- marine
- biology
pretty_name: Marine dataset
---
# Marine Biology - Continued Pre-Training Dataset
## Description
A corpus of Wikipedia articles covering marine biology and related domains, intended for continued pre-training (CPT) of language models on marine science knowledge.
## Content
Plain text articles scraped from Wikipedia across the following categories:
- Marine Biology
- Marine Ecology
- Ocean
- Coral Reefs
- Marine Mammals
- Oceanography
- Fisheries Science
- Marine Conservation
## Dataset Structure
| Split | Rows | Columns |
|-------|------|---------|
| train | 419 | title, text |
| test | 105 | title, text |
## Fields
- `title`: Wikipedia article title
- `text`: Clean plain text content of the article
## Construction
1. Article links scraped via Wikipedia Category API
2. Content fetched using Wikipedia API with `explaintext=True`
3. Text cleaned (whitespace normalization)
4. Split: 80% train / 20% test (seed=42)
## Intended Use
Continued pre-training phase before instruction fine-tuning.
Feed raw text to the model so it absorbs marine domain knowledge
before learning to answer questions.
## License
Wikipedia content is licensed under CC BY-SA 4.0.
提供机构:
Meriem-DH



