five

pavanmantha/arxiv-papers-qa

收藏
Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/pavanmantha/arxiv-papers-qa
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - question-answering --- This ML Q&A dataset contains 43,713 samples, where each includes three fields - question, context(title + abstract) and answer. It is created based on the original dataset aalksii/ml-arxiv-papers, which contains the titles and abstracts of ML ArXiv papers. To create question-answer pairs, the gpt-3.5-turbo API is called with the following prompt: messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": f"Title: \"{title}\". Abstract: \"{abstract}\". Given the above content, ask an AI-related question based on the title and generate an answer based on the abstract in JSON format {"question":" What/How xxx ?", "answer": "xxx." }. The question should be less than 25 tokens and the answer between 100 to 200 tokens."} ] After processing with the API, the low-quality Q&A samples are excluded. Good news: Llama-2-7B-Chat is already finetuned with this dataset!! And the checkpoint is available at https://huggingface.co/hanyueshf/llama-2-7b-chat-ml-qa. Acknowledgement: Thank Xinyu Wang(王欣宇) and Linze Li(李林泽) for their efforts in creating this dataset.

本机器学习问答数据集的许可证为MIT许可证,任务类别为问答。 本数据集共包含43713条样本,每条样本均包含三个字段:问题、上下文(标题+摘要)与答案。 本数据集基于原始数据集`aalksii/ml-arxiv-papers`构建,该数据集收录了机器学习领域ArXiv论文的标题与摘要。 为生成问答对,本数据集调用了gpt-3.5-turbo API,并使用如下提示词: messages = [ {"role": "system", "content": "你是一位乐于助人的助手。"}, {"role": "user", "content": f"标题:"{title}"。摘要:"{abstract}"。基于上述内容,结合标题提出一个与人工智能相关的问题,并依据摘要生成对应答案,最终以JSON格式输出,格式为{"question":"问题内容","answer":"答案内容"}。要求问题长度小于25个Token,答案长度介于100至200个Token之间。"} ] 经API处理后,本数据集已剔除低质量的问答样本。 好消息:Llama-2-7B-Chat已基于本数据集完成微调!其模型检查点可通过以下链接获取:https://huggingface.co/hanyueshf/llama-2-7b-chat-ml-qa。 致谢:感谢王欣宇(Xinyu Wang)与李林泽(Linze Li)为构建本数据集所付出的努力。
提供机构:
pavanmantha
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作