pavanmantha/arxiv-papers-qa

Name: pavanmantha/arxiv-papers-qa
Creator: pavanmantha
Published: 2026-04-16 00:57:22
License: 暂无描述

Hugging Face2026-04-16 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/pavanmantha/arxiv-papers-qa

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - question-answering --- This ML Q&A dataset contains 43,713 samples, where each includes three fields - question, context(title + abstract) and answer. It is created based on the original dataset aalksii/ml-arxiv-papers, which contains the titles and abstracts of ML ArXiv papers. To create question-answer pairs, the gpt-3.5-turbo API is called with the following prompt: messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": f"Title: \"{title}\". Abstract: \"{abstract}\". Given the above content, ask an AI-related question based on the title and generate an answer based on the abstract in JSON format {"question":" What/How xxx ?", "answer": "xxx." }. The question should be less than 25 tokens and the answer between 100 to 200 tokens."} ] After processing with the API, the low-quality Q&A samples are excluded. Good news: Llama-2-7B-Chat is already finetuned with this dataset!! And the checkpoint is available at https://huggingface.co/hanyueshf/llama-2-7b-chat-ml-qa. Acknowledgement: Thank Xinyu Wang(王欣宇) and Linze Li(李林泽) for their efforts in creating this dataset.

本机器学习问答数据集的许可证为MIT许可证，任务类别为问答。本数据集共包含43713条样本，每条样本均包含三个字段：问题、上下文（标题+摘要）与答案。本数据集基于原始数据集`aalksii/ml-arxiv-papers`构建，该数据集收录了机器学习领域ArXiv论文的标题与摘要。为生成问答对，本数据集调用了gpt-3.5-turbo API，并使用如下提示词： messages = [ {"role": "system", "content": "你是一位乐于助人的助手。"}, {"role": "user", "content": f"标题："{title}"。摘要："{abstract}"。基于上述内容，结合标题提出一个与人工智能相关的问题，并依据摘要生成对应答案，最终以JSON格式输出，格式为{"question":"问题内容","answer":"答案内容"}。要求问题长度小于25个Token，答案长度介于100至200个Token之间。"} ] 经API处理后，本数据集已剔除低质量的问答样本。好消息：Llama-2-7B-Chat已基于本数据集完成微调！其模型检查点可通过以下链接获取：https://huggingface.co/hanyueshf/llama-2-7b-chat-ml-qa。致谢：感谢王欣宇（Xinyu Wang）与李林泽（Linze Li）为构建本数据集所付出的努力。

提供机构：

pavanmantha

5,000+

优质数据集

54 个

任务类型

进入经典数据集