Normative Documents Interactive Question Answering Dataset (NDIQAD)
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/b64ggb36ht
下载链接
链接失效反馈官方服务:
资源简介:
This dataset consists of questions and answers based on selected normative documents. It includes 12 normative documents from different universities and banks, e.g. Study Rules of Mendel University in Brno, MIT Term Regulations and Examination Policies or Terms and Conditions for Personal Line of Credit (PLOC) of HSBC bank. These documents are manually annotated with 1767 questions by 15 annotators. The average document length is 14 pages. The dataset has 12.8 questions per page, 1.1 questions per paragraph, 33% coverage of paragraphs on average.
Questions are formed as one sentence. Answers are exact pieces of document texts. The searched area of one question is always one whole document. Each question-answer pair is also accompanied with a path leading over headings of the document from the document's root to the section containing the answer. This structural information enables testing of interactive question answering when the QA system is asking supplementary questions to limit the number of answers by disambiguating the document section.
The dataset contains the original documents in PDF together with questions and answers in CSV files. The CSV files use semicolon [;] as the separator, and optionally double quotes ["] to escape strings. The double quote is doubled when used inside a text [""]. Each row has four attributes, see an example of a dataset item below.
Document: Study Rules of Mendel University in Brno
Path: Study Rules / 2 Study in Bachelor’s and Master’s Degree Programs / 2.11 Study
interruption
Question: How can I interrupt my study?
Answer: Student’s study may be interrupted at the student’s request or ex officio.
You can find all questions twice in the dataset. The first version is the original question as written by the annotator. The second version marked "optimized" contains questions where the first person ("I", "me", "my") has been replaced by the actor of the document (a student or a client). This replacement is done automatically and can improve QA performance.
This dataset is unique mainly due to the type of the documents. Our focus is on normative documents with strict consistent formatting, numbered headings and paragraphs, low use of pronouns, etc. We do not focus on national laws as the majority of other research in this field, rather we focus on documents used to support internal processes in large organisations. We found that current QA methods do not perform well on this kind of documents. Hence, we have introduced this dataset to address the issues and we have also presented a new approach to QA on normative documents, see https://www.sciencedirect.com/science/article/pii/S0957417421016158?dgcid=coauthor
创建时间:
2021-12-20



