five

Tamalmajumder/Spectra_training_data

收藏
Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Tamalmajumder/Spectra_training_data
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - reinforcement-learning - visual-question-answering language: - en tags: - math - physics - biology - chemistry - geography - open-world size_categories: - 1K<n<10K --- # Spectra: Multimodal VQA training Data (Science + Open-World) ## Dataset Summary This dataset is multimodal QA training dataset for VLMs: - **TQA** — Graduate-level science questions - **OKVQA** — Open-world knowledge questions - **ScienceQA** — Graduate-level physics, mathematics, and geography - **AI2D** — Science questions across PCMB (Physics, Chemistry, Math, Biology) The goal of this dataset is to provide a balanced and contamination-controlled training data for enhancing reasoning and knowledge generalization across both **scientific** and **open-world** domains. ## Dataset Composition | Dataset | Subjects | Train | Test | Validation | |-----------|----------------------------------|-------|------|------------| | TQA | Graduate-level science | 1000 | 200 | 100 | | OKVQA | Open-world knowledge | 1000 | 200 | 100 | | ScienceQA | Physics, Math, Geography | 1000 | 200 | 100 | | AI2D | Science (PCMB) | 1000 | 200 | 100 | | **TOTAL** | Science + Open-world | 4000 | 800 | 400 | ## Data Collection & Sampling Strategy - For each source dataset, a subset was **randomly sampled**: - **1000 training samples** - **200 test samples** - **100 validation samples** - A **uniqueness enforcement loop** was applied during sampling: - Ensures no duplicate questions across splits - Prevents overlap between train, validation, and test sets - Minimizes risk of **data contamination** - Sampling was performed independently per dataset while maintaining global uniqueness constraints. --- ## License Please refer to the original datasets (TQA, OKVQA, ScienceQA, AI2D) for licensing terms. ## Citation If you use this dataset, please cite the original sources accordingly. Dataset Links: OKVQA Dataset: https://huggingface.co/datasets/HuggingFaceM4/A-OKVQA \ SCIENCE-QA Dataset: https://huggingface.co/datasets/derek-thomas/ScienceQA \ TQA Dataset: https://huggingface.co/datasets/yyyyifan/TQA \ AI2D Dataset: https://huggingface.co/datasets/lmms-lab/ai2d
提供机构:
Tamalmajumder
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作