ai-forever/paper_persi_chat
收藏Hugging Face2023-10-04 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ai-forever/paper_persi_chat
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
- summarization
- conversational
- question-answering
language:
- en
size_categories:
- 10K<n<100K
---
# PaperPersiChat Dataset
Dataset for paper [PaperPersiChat: Scientific Paper Discussion Chatbot using Transformers and Discourse Flow Management](https://aclanthology.org/2023.sigdial-1.54/)
# Dataset creation
To construct the dataset, we used the part of Semantic Scholar Open Research Corpus [https://github.com/allenai/s2orc] as the main source of scientific publications, namely the Computer Science section. We constructed dialogues over the segments of the papers where each segment consists of a combination of several sections of the paper that have the same type.
### Davinci dialogues
First, we tried to reproduce the dialogue of people discussing a particular segment of the paper. As the first utterance, the first speaker should introduce the section by providing a brief summary. Since the davinci model is capable of processing complex instructions, we selected it as the base model. We used the following prompt concatenated with the segment text as the model input:
`Generate a dialogue between you and another person based on the following paper. You have access to the paper. In the first utterance you should write a short summary. The other person sees only your summary and asks four (4) questions, separated by your answers.`
In this way, we collected 3588 raw outputs that were parsed further into summary and dialogue turns. All these summaries were used to train the summarization component. Then, we filtered unparsed outputs, short dialogues and dialogues with inconsistent structure (including incorrect order of speakers in utterances). Thus, we obtained the set of 2817 dialogues that were used to train the models from the QA session module.
### ChatGPT dialogues
To construct more qualitative dialogues, and also to consider the fact that a real person sees only summaries, we used two ChatGPT models talking to each other. The first acted as a bot, and the second as a real person. Here, we used the summarization model trained on the davinci outputs to construct the inputs of the second model. The prompts used are the following:
1. Bot-like model. "You should briefly answer the questions on the following text. If there is no answer in the given text, then you must answer that there is not enough information. Your answers should be brief." + full context
2. Person-like model. "You should be asking short questions about an article you can't see. You only see the following summary. Your task is to ask clarifying dependent questions in order to understand the source text. You can ask only single short question at each turn." + summary produced by our summarizer.
We carried out four dialogue turns between these two models for each segment. In this case, postprocessing parsing is not required, since each model generates only one utterance at each step. We collected 8787 dialogues in total.
# Dataset structure
We share the resulting dataset via two json files consisting instances with the structure demonstrated by the following example:
```json
{
"text": "Table 1 and Table 2 describe...",
"dialogue": "What is the improvement achieved...",
"meta_segments": [
{"id": "ffa_15", "title": "Model", "section_type": "methodology"},
{"id": "ffa_16", "title": "Comparison To Other Models", "section_type": "methodology"}
],
"meta_paper": {
"title": "Correcting Forecasts with Multifactor Neural Attention",
"paper_id": "ffa"
},
"parsed_dialogue": {
"summary": "This paper presents a multifactor attention approach...",
"turns":
[{"speaker": "person", "text": "What is the improvement achieved..."},
{"speaker": "bot", "text": "The proposed approach achieves..."}, ...]
}
}
```
Here, "text" is the entire input context, "dialogue" is the raw Davinci output or the dialogue constructed by two ChatGPT models joined by '\n' tokens, "meta_segments" and "meta_paper" show additional meta information about the segments (including scipdf parsing results). The "parsed_dialogue" field contains resulting postprocessed dialogues that have the summary produced by the summarization module in the case of ChatGPT or a generated summary in the case of Davinci.
# Citation
If you find this dataset helpful, feel free to cite our publication [PaperPersiChat: Scientific Paper Discussion Chatbot using Transformers and Discourse Flow Management](https://aclanthology.org/2023.sigdial-1.54/):
```
@inproceedings{chernyavskiy-etal-2023-paperpersichat,
title = "{P}aper{P}ersi{C}hat: Scientific Paper Discussion Chatbot using Transformers and Discourse Flow Management",
author = "Chernyavskiy, Alexander and
Bregeda, Max and
Nikiforova, Maria",
booktitle = "Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue",
month = sep,
year = "2023",
address = "Prague, Czechia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.sigdial-1.54",
pages = "584--587",
}
```
提供机构:
ai-forever
原始信息汇总
数据集名称
PaperPersiChat Dataset
数据集用途
用于训练科学论文讨论聊天机器人,支持文本生成、摘要、对话和问答任务。
数据集语言
英语(en)
数据集大小
包含10,000至100,000条数据。
数据集创建方法
- 来源:使用Semantic Scholar Open Research Corpus的计算机科学部分作为主要数据源。
- 对话构建:通过组合论文中的相同类型部分来构建对话段落。
- Davinci对话:使用Davinci模型生成对话,首先由一方提供摘要,另一方提问,共收集3588条原始输出,经过筛选后得到2817条用于训练。
- ChatGPT对话:使用两个ChatGPT模型交互生成更高质量的对话,共收集8787条对话。
数据集结构
数据集通过两个JSON文件共享,每个实例包含以下结构:
- text:输入的全文内容。
- dialogue:由Davinci或两个ChatGPT模型生成的对话。
- meta_segments:关于段落的元信息,包括ID、标题和类型。
- meta_paper:关于论文的元信息,包括标题和ID。
- parsed_dialogue:经过处理的对话,包含摘要和对话轮次。
引用信息
若使用此数据集,请引用以下出版物:
@inproceedings{chernyavskiy-etal-2023-paperpersichat, title = "{P}aper{P}ersi{C}hat: Scientific Paper Discussion Chatbot using Transformers and Discourse Flow Management", author = "Chernyavskiy, Alexander and Bregeda, Max and Nikiforova, Maria", booktitle = "Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue", month = sep, year = "2023", address = "Prague, Czechia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.sigdial-1.54", pages = "584--587", }



