five

Meriem-DH/marine-dataset-qa

收藏
Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Meriem-DH/marine-dataset-qa
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: instruction dtype: string - name: response dtype: string - name: source dtype: string splits: - name: train num_bytes: 73357 num_examples: 439 - name: test num_bytes: 18458 num_examples: 109 download_size: 57119 dataset_size: 91815 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* license: cc-by-4.0 task_categories: - text-generation language: - en tags: - ocean - marine_biology - biology - climate pretty_name: Marine Dataset Q/A --- # Marine Biology - Instruction Fine-Tuning Dataset (Q&A) ## Description A question-answer dataset on marine biology topics, generated from Wikipedia articles using the Groq API (LLaMA 3.3 70B). Intended for supervised fine-tuning (SFT) of language models to answer marine science questions. ## Content Q&A pairs generated from Wikipedia articles across the following categories: - Marine Biology - Marine Ecology - Ocean - Coral Reefs - Marine Mammals - Oceanography - Fisheries Science - Marine Conservation ## Dataset Structure | Split | Rows | Columns | |-------|------|---------| | train | 439 | instruction, response, source | | test | 109 | instruction, response, source | ## Fields - `instruction`: Question about a marine biology topic - `response`: Answer generated from the Wikipedia article - `source`: Title of the Wikipedia article used to generate the pair ## Construction 1. Article links scraped via Wikipedia Category API 2. Content fetched using Wikipedia API with `explaintext=True` 3. Q&A pairs generated via Groq API (llama-3.3-70b-versatile, n=3 per article) 4. Split: 80% train / 20% test (seed=42) 5. Articles used for Q&A are distinct from those used for CPT (no overlap) ## Intended Use Instruction fine-tuning after continued pre-training on the CPT dataset. Teaches the model to respond in a chatbot format on marine biology topics. ## Related Dataset - [marine-biology-cpt](https://huggingface.co/datasets/Meriem-DH/marine-biology-cpt) ## License Q&A pairs generated from Wikipedia content licensed under CC BY-SA 4.0. Generated content by LLaMA 3.3 70B via Groq API.
提供机构:
Meriem-DH
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作