Meriem-DH/marine-dataset-qa
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Meriem-DH/marine-dataset-qa
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: instruction
dtype: string
- name: response
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 73357
num_examples: 439
- name: test
num_bytes: 18458
num_examples: 109
download_size: 57119
dataset_size: 91815
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
license: cc-by-4.0
task_categories:
- text-generation
language:
- en
tags:
- ocean
- marine_biology
- biology
- climate
pretty_name: Marine Dataset Q/A
---
# Marine Biology - Instruction Fine-Tuning Dataset (Q&A)
## Description
A question-answer dataset on marine biology topics, generated from Wikipedia
articles using the Groq API (LLaMA 3.3 70B). Intended for supervised
fine-tuning (SFT) of language models to answer marine science questions.
## Content
Q&A pairs generated from Wikipedia articles across the following categories:
- Marine Biology
- Marine Ecology
- Ocean
- Coral Reefs
- Marine Mammals
- Oceanography
- Fisheries Science
- Marine Conservation
## Dataset Structure
| Split | Rows | Columns |
|-------|------|---------|
| train | 439 | instruction, response, source |
| test | 109 | instruction, response, source |
## Fields
- `instruction`: Question about a marine biology topic
- `response`: Answer generated from the Wikipedia article
- `source`: Title of the Wikipedia article used to generate the pair
## Construction
1. Article links scraped via Wikipedia Category API
2. Content fetched using Wikipedia API with `explaintext=True`
3. Q&A pairs generated via Groq API (llama-3.3-70b-versatile, n=3 per article)
4. Split: 80% train / 20% test (seed=42)
5. Articles used for Q&A are distinct from those used for CPT (no overlap)
## Intended Use
Instruction fine-tuning after continued pre-training on the CPT dataset.
Teaches the model to respond in a chatbot format on marine biology topics.
## Related Dataset
- [marine-biology-cpt](https://huggingface.co/datasets/Meriem-DH/marine-biology-cpt)
## License
Q&A pairs generated from Wikipedia content licensed under CC BY-SA 4.0.
Generated content by LLaMA 3.3 70B via Groq API.
提供机构:
Meriem-DH



