five

ai-forever/paper_persi_chat

收藏
Hugging Face2023-10-04 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ai-forever/paper_persi_chat
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation - summarization - conversational - question-answering language: - en size_categories: - 10K<n<100K --- # PaperPersiChat Dataset Dataset for paper [PaperPersiChat: Scientific Paper Discussion Chatbot using Transformers and Discourse Flow Management](https://aclanthology.org/2023.sigdial-1.54/) # Dataset creation To construct the dataset, we used the part of Semantic Scholar Open Research Corpus [https://github.com/allenai/s2orc] as the main source of scientific publications, namely the Computer Science section. We constructed dialogues over the segments of the papers where each segment consists of a combination of several sections of the paper that have the same type. ### Davinci dialogues First, we tried to reproduce the dialogue of people discussing a particular segment of the paper. As the first utterance, the first speaker should introduce the section by providing a brief summary. Since the davinci model is capable of processing complex instructions, we selected it as the base model. We used the following prompt concatenated with the segment text as the model input: `Generate a dialogue between you and another person based on the following paper. You have access to the paper. In the first utterance you should write a short summary. The other person sees only your summary and asks four (4) questions, separated by your answers.` In this way, we collected 3588 raw outputs that were parsed further into summary and dialogue turns. All these summaries were used to train the summarization component. Then, we filtered unparsed outputs, short dialogues and dialogues with inconsistent structure (including incorrect order of speakers in utterances). Thus, we obtained the set of 2817 dialogues that were used to train the models from the QA session module. ### ChatGPT dialogues To construct more qualitative dialogues, and also to consider the fact that a real person sees only summaries, we used two ChatGPT models talking to each other. The first acted as a bot, and the second as a real person. Here, we used the summarization model trained on the davinci outputs to construct the inputs of the second model. The prompts used are the following: 1. Bot-like model. "You should briefly answer the questions on the following text. If there is no answer in the given text, then you must answer that there is not enough information. Your answers should be brief." + full context 2. Person-like model. "You should be asking short questions about an article you can't see. You only see the following summary. Your task is to ask clarifying dependent questions in order to understand the source text. You can ask only single short question at each turn." + summary produced by our summarizer. We carried out four dialogue turns between these two models for each segment. In this case, postprocessing parsing is not required, since each model generates only one utterance at each step. We collected 8787 dialogues in total. # Dataset structure We share the resulting dataset via two json files consisting instances with the structure demonstrated by the following example: ```json { "text": "Table 1 and Table 2 describe...", "dialogue": "What is the improvement achieved...", "meta_segments": [ {"id": "ffa_15", "title": "Model", "section_type": "methodology"}, {"id": "ffa_16", "title": "Comparison To Other Models", "section_type": "methodology"} ], "meta_paper": { "title": "Correcting Forecasts with Multifactor Neural Attention", "paper_id": "ffa" }, "parsed_dialogue": { "summary": "This paper presents a multifactor attention approach...", "turns": [{"speaker": "person", "text": "What is the improvement achieved..."}, {"speaker": "bot", "text": "The proposed approach achieves..."}, ...] } } ``` Here, "text" is the entire input context, "dialogue" is the raw Davinci output or the dialogue constructed by two ChatGPT models joined by '\n' tokens, "meta_segments" and "meta_paper" show additional meta information about the segments (including scipdf parsing results). The "parsed_dialogue" field contains resulting postprocessed dialogues that have the summary produced by the summarization module in the case of ChatGPT or a generated summary in the case of Davinci. # Citation If you find this dataset helpful, feel free to cite our publication [PaperPersiChat: Scientific Paper Discussion Chatbot using Transformers and Discourse Flow Management](https://aclanthology.org/2023.sigdial-1.54/): ``` @inproceedings{chernyavskiy-etal-2023-paperpersichat, title = "{P}aper{P}ersi{C}hat: Scientific Paper Discussion Chatbot using Transformers and Discourse Flow Management", author = "Chernyavskiy, Alexander and Bregeda, Max and Nikiforova, Maria", booktitle = "Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue", month = sep, year = "2023", address = "Prague, Czechia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.sigdial-1.54", pages = "584--587", } ```
提供机构:
ai-forever
原始信息汇总

数据集名称

PaperPersiChat Dataset

数据集用途

用于训练科学论文讨论聊天机器人,支持文本生成、摘要、对话和问答任务。

数据集语言

英语(en)

数据集大小

包含10,000至100,000条数据。

数据集创建方法

  • 来源:使用Semantic Scholar Open Research Corpus的计算机科学部分作为主要数据源。
  • 对话构建:通过组合论文中的相同类型部分来构建对话段落。
  • Davinci对话:使用Davinci模型生成对话,首先由一方提供摘要,另一方提问,共收集3588条原始输出,经过筛选后得到2817条用于训练。
  • ChatGPT对话:使用两个ChatGPT模型交互生成更高质量的对话,共收集8787条对话。

数据集结构

数据集通过两个JSON文件共享,每个实例包含以下结构:

  • text:输入的全文内容。
  • dialogue:由Davinci或两个ChatGPT模型生成的对话。
  • meta_segments:关于段落的元信息,包括ID、标题和类型。
  • meta_paper:关于论文的元信息,包括标题和ID。
  • parsed_dialogue:经过处理的对话,包含摘要和对话轮次。

引用信息

若使用此数据集,请引用以下出版物:

@inproceedings{chernyavskiy-etal-2023-paperpersichat, title = "{P}aper{P}ersi{C}hat: Scientific Paper Discussion Chatbot using Transformers and Discourse Flow Management", author = "Chernyavskiy, Alexander and Bregeda, Max and Nikiforova, Maria", booktitle = "Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue", month = sep, year = "2023", address = "Prague, Czechia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.sigdial-1.54", pages = "584--587", }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作