SPIKE-QA: A 50K size English dataset for SLM
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14584015
下载链接
链接失效反馈官方服务:
资源简介:
SPIKE-QA is a human-indicated QA dataset generated by the GPT4o-small model, the dataset is collected as well as merged by the author with Python script. It contains 50,236 pairs of Q&A samples without time information but just single independent questions and answers.(Zero-Shot)
The topic covers basic science like physics, chemistry, or math to complex generation problems or some daily chat. The dataset is in the form of a bunch of Excel tables, each of which holds two feature meanings as they are named "Question" and "Answer." The file name SPIKE-QA.csv is the complete dataset in the form of CSV. The data collected by giving a prompt to GPT to ensure the generation is in a form in pairs of tuples, like lis=[("Question1", "Answer1"),("Question2", "Answer2"),...], and transform it with python script.
The size of the data might not be enough to pre-train an LLM from the start, it only seems to be used for parameter tuning, but paraphrasing the dataset might be one way to change the data into a useful resource. The dataset could also be used for model evaluation due to its diversity and vary length of the samples. The most important thing is accessibility, this dataset is a CSV file, making it easy for beginner to practice.
Copy right reserved by the author(ORCID:0009-0002-1449-2803). An alternative of doi for this dataset is 10.34740/kaggle/dsv/10346351.
创建时间:
2025-01-09



