pavanmantha/arxiv-papers-qa
收藏Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/pavanmantha/arxiv-papers-qa
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- question-answering
---
This ML Q&A dataset contains 43,713 samples, where each includes three fields - question, context(title + abstract) and answer.
It is created based on the original dataset aalksii/ml-arxiv-papers, which contains the titles and abstracts of ML ArXiv papers.
To create question-answer pairs, the gpt-3.5-turbo API is called with the following prompt:
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": f"Title: \"{title}\". Abstract:
\"{abstract}\". Given the above content, ask an AI-related question based on
the title and generate an answer based on the abstract in JSON format {"question":"
What/How xxx ?", "answer": "xxx." }. The question should be less than 25 tokens and
the answer between 100 to 200 tokens."}
]
After processing with the API, the low-quality Q&A samples are excluded.
Good news: Llama-2-7B-Chat is already finetuned with this dataset!! And the checkpoint is available at https://huggingface.co/hanyueshf/llama-2-7b-chat-ml-qa.
Acknowledgement:
Thank Xinyu Wang(王欣宇) and Linze Li(李林泽) for their efforts in creating this dataset.
本机器学习问答数据集的许可证为MIT许可证,任务类别为问答。
本数据集共包含43713条样本,每条样本均包含三个字段:问题、上下文(标题+摘要)与答案。
本数据集基于原始数据集`aalksii/ml-arxiv-papers`构建,该数据集收录了机器学习领域ArXiv论文的标题与摘要。
为生成问答对,本数据集调用了gpt-3.5-turbo API,并使用如下提示词:
messages = [
{"role": "system", "content": "你是一位乐于助人的助手。"},
{"role": "user", "content": f"标题:"{title}"。摘要:"{abstract}"。基于上述内容,结合标题提出一个与人工智能相关的问题,并依据摘要生成对应答案,最终以JSON格式输出,格式为{"question":"问题内容","answer":"答案内容"}。要求问题长度小于25个Token,答案长度介于100至200个Token之间。"}
]
经API处理后,本数据集已剔除低质量的问答样本。
好消息:Llama-2-7B-Chat已基于本数据集完成微调!其模型检查点可通过以下链接获取:https://huggingface.co/hanyueshf/llama-2-7b-chat-ml-qa。
致谢:感谢王欣宇(Xinyu Wang)与李林泽(Linze Li)为构建本数据集所付出的努力。
提供机构:
pavanmantha



