five

SPIKE-QA: A 50K size English dataset for SLM

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14584015
下载链接
链接失效反馈
官方服务:
资源简介:
SPIKE-QA is a human-indicated QA dataset generated by the GPT4o-small model, the dataset is collected as well as merged by the author with Python script. It contains 50,236 pairs of Q&A samples without time information but just single independent questions and answers.(Zero-Shot) The topic covers basic science like physics, chemistry, or math to complex generation problems or some daily chat. The dataset is in the form of a bunch of Excel tables, each of which holds two feature meanings as they are named "Question" and "Answer." The file name SPIKE-QA.csv is the complete dataset in the form of CSV. The data collected by giving a prompt to GPT to ensure the generation is in a form in pairs of tuples, like lis=[("Question1", "Answer1"),("Question2", "Answer2"),...], and transform it with python script. The size of the data might not be enough to pre-train an LLM from the start, it only seems to be used for parameter tuning, but paraphrasing the dataset might be one way to change the data into a useful resource. The dataset could also be used for model evaluation due to its diversity and vary length of the samples. The most important thing is accessibility, this dataset is a CSV file, making it easy for beginner to practice. Copy right reserved by the author(ORCID:0009-0002-1449-2803). An alternative of doi for this dataset is 10.34740/kaggle/dsv/10346351.
创建时间:
2025-01-09
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作