five

Meriem-DH/marine-dataset-cpt

收藏
Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Meriem-DH/marine-dataset-cpt
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-generation language: - en tags: - ocean - marine - biology pretty_name: Marine dataset --- # Marine Biology - Continued Pre-Training Dataset ## Description A corpus of Wikipedia articles covering marine biology and related domains, intended for continued pre-training (CPT) of language models on marine science knowledge. ## Content Plain text articles scraped from Wikipedia across the following categories: - Marine Biology - Marine Ecology - Ocean - Coral Reefs - Marine Mammals - Oceanography - Fisheries Science - Marine Conservation ## Dataset Structure | Split | Rows | Columns | |-------|------|---------| | train | 419 | title, text | | test | 105 | title, text | ## Fields - `title`: Wikipedia article title - `text`: Clean plain text content of the article ## Construction 1. Article links scraped via Wikipedia Category API 2. Content fetched using Wikipedia API with `explaintext=True` 3. Text cleaned (whitespace normalization) 4. Split: 80% train / 20% test (seed=42) ## Intended Use Continued pre-training phase before instruction fine-tuning. Feed raw text to the model so it absorbs marine domain knowledge before learning to answer questions. ## License Wikipedia content is licensed under CC BY-SA 4.0.
提供机构:
Meriem-DH
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作