SPIKE-QA: A 50K size English dataset for SLM

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/14584015

下载链接

链接失效反馈

官方服务：

资源简介：

SPIKE-QA is a human-indicated QA dataset generated by the GPT4o-small model, the dataset is collected as well as merged by the author with Python script. It contains 50,236 pairs of Q&A samples without time information but just single independent questions and answers.(Zero-Shot) The topic covers basic science like physics, chemistry, or math to complex generation problems or some daily chat. The dataset is in the form of a bunch of Excel tables, each of which holds two feature meanings as they are named "Question" and "Answer." The file name SPIKE-QA.csv is the complete dataset in the form of CSV. The data collected by giving a prompt to GPT to ensure the generation is in a form in pairs of tuples, like lis=[("Question1", "Answer1"),("Question2", "Answer2"),...], and transform it with python script. The size of the data might not be enough to pre-train an LLM from the start, it only seems to be used for parameter tuning, but paraphrasing the dataset might be one way to change the data into a useful resource. The dataset could also be used for model evaluation due to its diversity and vary length of the samples. The most important thing is accessibility, this dataset is a CSV file, making it easy for beginner to practice. Copy right reserved by the author(ORCID:0009-0002-1449-2803). An alternative of doi for this dataset is 10.34740/kaggle/dsv/10346351.

创建时间：

2025-01-09

5,000+

优质数据集

54 个

任务类型

进入经典数据集