five

paper_persi_chat

收藏
魔搭社区2025-07-24 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/ai-forever/paper_persi_chat
下载链接
链接失效反馈
官方服务:
资源简介:
# PaperPersiChat Dataset Dataset for paper [PaperPersiChat: Scientific Paper Discussion Chatbot using Transformers and Discourse Flow Management](https://aclanthology.org/2023.sigdial-1.54/) # Dataset creation To construct the dataset, we used the part of Semantic Scholar Open Research Corpus [https://github.com/allenai/s2orc] as the main source of scientific publications, namely the Computer Science section. We constructed dialogues over the segments of the papers where each segment consists of a combination of several sections of the paper that have the same type. ### Davinci dialogues First, we tried to reproduce the dialogue of people discussing a particular segment of the paper. As the first utterance, the first speaker should introduce the section by providing a brief summary. Since the davinci model is capable of processing complex instructions, we selected it as the base model. We used the following prompt concatenated with the segment text as the model input: `Generate a dialogue between you and another person based on the following paper. You have access to the paper. In the first utterance you should write a short summary. The other person sees only your summary and asks four (4) questions, separated by your answers.` In this way, we collected 3588 raw outputs that were parsed further into summary and dialogue turns. All these summaries were used to train the summarization component. Then, we filtered unparsed outputs, short dialogues and dialogues with inconsistent structure (including incorrect order of speakers in utterances). Thus, we obtained the set of 2817 dialogues that were used to train the models from the QA session module. ### ChatGPT dialogues To construct more qualitative dialogues, and also to consider the fact that a real person sees only summaries, we used two ChatGPT models talking to each other. The first acted as a bot, and the second as a real person. Here, we used the summarization model trained on the davinci outputs to construct the inputs of the second model. The prompts used are the following: 1. Bot-like model. "You should briefly answer the questions on the following text. If there is no answer in the given text, then you must answer that there is not enough information. Your answers should be brief." + full context 2. Person-like model. "You should be asking short questions about an article you can't see. You only see the following summary. Your task is to ask clarifying dependent questions in order to understand the source text. You can ask only single short question at each turn." + summary produced by our summarizer. We carried out four dialogue turns between these two models for each segment. In this case, postprocessing parsing is not required, since each model generates only one utterance at each step. We collected 8787 dialogues in total. # Dataset structure We share the resulting dataset via two json files consisting instances with the structure demonstrated by the following example: ```json { "text": "Table 1 and Table 2 describe...", "dialogue": "What is the improvement achieved...", "meta_segments": [ {"id": "ffa_15", "title": "Model", "section_type": "methodology"}, {"id": "ffa_16", "title": "Comparison To Other Models", "section_type": "methodology"} ], "meta_paper": { "title": "Correcting Forecasts with Multifactor Neural Attention", "paper_id": "ffa" }, "parsed_dialogue": { "summary": "This paper presents a multifactor attention approach...", "turns": [{"speaker": "person", "text": "What is the improvement achieved..."}, {"speaker": "bot", "text": "The proposed approach achieves..."}, ...] } } ``` Here, "text" is the entire input context, "dialogue" is the raw Davinci output or the dialogue constructed by two ChatGPT models joined by '\n' tokens, "meta_segments" and "meta_paper" show additional meta information about the segments (including scipdf parsing results). The "parsed_dialogue" field contains resulting postprocessed dialogues that have the summary produced by the summarization module in the case of ChatGPT or a generated summary in the case of Davinci. # Citation If you find this dataset helpful, feel free to cite our publication [PaperPersiChat: Scientific Paper Discussion Chatbot using Transformers and Discourse Flow Management](https://aclanthology.org/2023.sigdial-1.54/): ``` @inproceedings{chernyavskiy-etal-2023-paperpersichat, title = "{P}aper{P}ersi{C}hat: Scientific Paper Discussion Chatbot using Transformers and Discourse Flow Management", author = "Chernyavskiy, Alexander and Bregeda, Max and Nikiforova, Maria", booktitle = "Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue", month = sep, year = "2023", address = "Prague, Czechia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.sigdial-1.54", pages = "584--587", } ```

# PaperPersiChat 数据集 本数据集对应论文《PaperPersiChat: 基于Transformer与语篇流管理的学术论文讨论聊天机器人》(https://aclanthology.org/2023.sigdial-1.54/) # 数据集构建 为构建本数据集,我们以Semantic Scholar开放研究语料库(Semantic Scholar Open Research Corpus,S2ORC,https://github.com/allenai/s2orc)中的计算机科学分区作为学术论文的主要来源。我们针对论文的分段构建对话:每一分段由若干属于同一类型的论文章节组合而成。 ### Davinci 生成对话 首先,我们尝试复现人们讨论某一论文分段的对话场景。第一位发言者需先对该章节进行简要概述,作为首条发言内容。由于Davinci模型能够处理复杂指令,我们选择其作为基础模型。我们将以下提示词与分段文本拼接后作为模型输入: `基于以下论文生成你与另一人的对话。你可获取该论文全文。首条发言需撰写一段简短概述。另一方仅能看到你的概述,并会提出四个(4)问题,问题与回答交替出现。` 通过该方式,我们共收集到3588条原始输出,后续将其解析为概述与对话轮次。所有上述概述均用于训练摘要生成模块。随后,我们过滤掉无法解析的输出、过短对话以及结构不一致的对话(包括发言者顺序错误的对话),最终得到2817条对话,用于训练问答会话模块的模型。 ### ChatGPT 生成对话 为构建质量更高的对话,同时贴合真实用户仅能看到概述的场景,我们采用两个ChatGPT模型进行互相对话:一方扮演聊天机器人,另一方扮演真实用户。我们使用基于Davinci输出训练得到的摘要生成模型,来构造第二个模型的输入。所使用的提示词如下: 1. 机器人角色模型:`针对以下文本简要回答问题。若给定文本中无对应答案,请回复信息不足。回答需简洁。` + 完整上下文 2. 用户角色模型:`你无法查看文章全文,仅能看到以下概述。请针对该概述提出简短的澄清式问题,以理解原文内容。每一轮仅可提出一个简短问题。` + 我们的摘要生成模型生成的概述 我们为每个分段在两个模型间生成四轮对话。由于每一轮仅单个模型生成一条发言,因此无需进行后处理解析。最终共收集到8787条对话。 # 数据集结构 我们通过两个JSON文件分享最终构建的数据集,数据实例的结构如下例所示: json { "text": "Table 1 and Table 2 describe...", "dialogue": "What is the improvement achieved...", "meta_segments": [ {"id": "ffa_15", "title": "Model", "section_type": "methodology"}, {"id": "ffa_16", "title": "Comparison To Other Models", "section_type": "methodology"} ], "meta_paper": { "title": "Correcting Forecasts with Multifactor Neural Attention", "paper_id": "ffa" }, "parsed_dialogue": { "summary": "This paper presents a multifactor attention approach...", "turns": [{"speaker": "person", "text": "What is the improvement achieved..."}, {"speaker": "bot", "text": "The proposed approach achieves..."}, ...] } } 其中,`text`为完整输入上下文;`dialogue`为原始Davinci输出,或由两个ChatGPT模型生成并以换行符(` `)拼接的对话;`meta_segments`与`meta_paper`包含分段与论文的附加元信息(包括scipdf解析结果);`parsed_dialogue`字段包含经过后处理的对话结果:对于ChatGPT生成的对话,其内容包含摘要生成模块生成的概述;对于Davinci生成的对话,其内容包含模型生成的概述。 # 引用方式 若您认为本数据集对您的研究有所帮助,请引用我们的论文《PaperPersiChat: 基于Transformer与语篇流管理的学术论文讨论聊天机器人》(https://aclanthology.org/2023.sigdial-1.54/): bibtex @inproceedings{chernyavskiy-etal-2023-paperpersichat, title = "{P}aper{P}ersi{C}hat: Scientific Paper Discussion Chatbot using Transformers and Discourse Flow Management", author = "Chernyavskiy, Alexander and Bregeda, Max and Nikiforova, Maria", booktitle = "Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue", month = sep, year = "2023", address = "Prague, Czechia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.sigdial-1.54", pages = "584--587", }
提供机构:
maas
创建时间:
2025-05-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作