scherrmann/adhoc_quad

Name: scherrmann/adhoc_quad
Creator: scherrmann
Published: 2023-11-16 09:24:41
License: 暂无描述

Hugging Face2023-11-16 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/scherrmann/adhoc_quad

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* dataset_info: features: - name: id dtype: string - name: context dtype: string - name: question dtype: string - name: answers sequence: - name: text dtype: string - name: answer_start dtype: int32 splits: - name: train num_bytes: 10365360 num_examples: 6659 - name: validation num_bytes: 1157605 num_examples: 748 download_size: 3088466 dataset_size: 11522965 --- # Dataset Card for "adhoc_quad" ## Dataset Summary The German Ad-Hoc Question Answering Dataset (AdHocQuAD) is a reading comprehension dataset for German financial texts. It is a machine generated dataset, where ChatGPT (Version 3.5) is used to ask questions on a set of German Ad-Hoc announcements. The answer to every question is a segment of text, or span, from the corresponding reading passage. ## Supported Tasks and Leaderboards extractive-qa, closed-domain-qa, open-domain-qa, text-retrieval: This dataset is intended to be used for open-domain-qa, but can also be used for information retrieval tasks. ## Languages The texts in the dataset are in German (de). # Dataset Structure ## Data Instances A sample from the training set is provided below: { "context": "This is a test context with eight words.", "id": "1", "question": "How many words contains the context?", "answers": { "answer_start": [28], "text": ["eight"] } } ## Data Fields id: a string feature. context: a string feature. question: a string feature. answers: a dictionary feature containing: text: a string feature. answer_start: a int32 feature. # Additional Information ## Details on the Generation of the Ad-Hoc QuAD Database To construct the ad-hoc QuAD database, I use 9,132 German ad-hoc announcements as context strings. Announcements exceeding 15 sentences are truncated to ensure compatibility with BERT's input limitations in subsequent applications. After that, there is a need to identify questions and appropriate answers that reference the given ad-hoc announcements. Given that manual generation of questions and answers is both resource-intensive and time-consuming, I employ the OpenAI's ChatGPT model (gpt-3.5-turbo). In a first step, I ask ChatGPT to generate three suitable questions for a given announcement. The prompt looks as follows: Create three questions for the following text. It should be possible to answer the question with a substring of the input text. The questions should ask for different aspects of the input. The questions should be in German. Text: <<context>> Question: In the pursuit of creating an extractive QuAD task, it is imperative to instruct the model such that every question can be answered using a substring from the provided announcement. This strategy aims to prevent the model from generating open-ended questions or those requiring external knowledge not present in the announcement. Additionally, the model is directed to address various aspects of the announcement to minimize question redundancy. Notably, despite the context strings being in German, ChatGPT occasionally formulates questions in English. To counteract this, explicit instructions are given to ensure questions are posed in German. Employing this methodology yields 9,132 unique context-question pairs. In a second step, I use ChatGPT again to extract the substring that answers to question to a specific context string. The respective prompt is given by: You have given a text and a question to that text. Find the answer as a substring of the input text. It is crucial that the answer is contained exactly as a substring in the input text, even if this implies that the answer is not a full sentence. Example: Text: 'Herr Müller ist 37 Jahre alt.' Question: 'Wie alt ist Herr Müller?' Answer: '37 Jahre' Text: <<context>> Question: <<question>> Answer: Evaluations of the method of extracting substrings from a specified context to answer a posed question via ChatGPT indicated a recurrent issue: ChatGPT frequently transformed the substring into a complete sentence, thereby compromising the extractive nature of the resultant database. Emphasizing the necessity for extractive answers, coupled with a demonstrative example, markedly enhanced the outcomes. However, of the responses generated by ChatGPT, 1,725 are not given as substrings of the context, leading to a final ad-hoc QuAD database size of 7,407. The code for creating the dataset can be found [here](https://github.com/FinTexIFB/AdHocQuAD). ## Dataset Curators The dataset was created by Moritz Scherrmann using ChatGPT 3.5 turbo ## Citation Information @misc{scherrmann2023german, title={German FinBERT: A German Pre-trained Language Model}, author={Moritz Scherrmann}, year={2023}, eprint={2311.08793}, archivePrefix={arXiv}, primaryClass={cs.CL} }

提供机构：

scherrmann

原始信息汇总

数据集卡片 "adhoc_quad"

数据集概述

德国Ad-Hoc问答数据集（AdHocQuAD）是一个针对德国金融文本的阅读理解数据集。它是一个机器生成的数据集，使用ChatGPT（版本3.5）对一组德国Ad-Hoc公告提出问题。每个问题的答案都是相应阅读段落中的一个文本片段或跨度。

支持的任务和排行榜

extractive-qa, closed-domain-qa, open-domain-qa, text-retrieval：该数据集旨在用于开放领域问答，但也可用于信息检索任务。

语言

数据集中的文本为德语（de）。

数据集结构

数据实例

训练集中的一个样本如下：

{
    "context": "这是一个包含八个词的测试上下文。",
    "id": "1",
    "question": "上下文包含多少个词？",
    "answers": {
        "answer_start": [28],
        "text": ["八个"]
    }
}

数据字段

id: 字符串特征。
context: 字符串特征。
question: 字符串特征。
answers: 包含以下字段的字典特征：
    text: 字符串特征。
    answer_start: int32特征。

附加信息

Ad-Hoc QuAD数据库生成细节

为了构建Ad-Hoc QuAD数据库，我使用了9,132份德国Ad-Hoc公告作为上下文字符串。超过15句话的公告被截断，以确保与BERT的输入限制在后续应用中的兼容性。

接下来，需要识别与给定Ad-Hoc公告相关的问题和适当的答案。由于手动生成问题和答案既耗费资源又耗时，我使用了OpenAI的ChatGPT模型（gpt-3.5-turbo）。

首先，我要求ChatGPT为给定的公告生成三个合适的问题。提示如下：

Create three questions for the following text. 
It should be possible to answer the question with a substring of the input text. 
The questions should ask for different aspects of the input. 
The questions should be in German.

Text: <<context>>
Question:

为了创建一个抽取式QuAD任务，必须指示模型使得每个问题都可以用提供的公告中的一个子字符串来回答。这种方法旨在防止模型生成开放式问题或需要公告中不存在的额外知识的问题。此外，模型被引导关注公告的不同方面，以减少问题的冗余。尽管上下文字符串是德语，但ChatGPT有时会用英语提出问题。为了解决这个问题，给出了明确的指示，确保问题以德语提出。采用这种方法产生了9,132个唯一的上下文-问题对。

在第二步中，我再次使用ChatGPT从特定的上下文字符串中提取回答问题的子字符串。相应的提示如下：

You have given a text and a question to that text. Find the answer as a substring of the input text. 
It is crucial that the answer is contained exactly as a substring in the input text, even if this implies that the answer is not a full sentence. 

Example:
Text: Herr Müller ist 37 Jahre alt.
Question: Wie alt ist Herr Müller?
Answer: 37 Jahre

Text: <<context>>
Question: <<question>>
Answer:

通过ChatGPT从指定上下文中提取子字符串以回答所提问题的方法评估表明，一个常见问题是ChatGPT经常将子字符串转换为完整句子，从而破坏了结果数据库的抽取性质。强调抽取答案的必要性，加上一个示范性示例，显著改善了结果。然而，在ChatGPT生成的答案中，有1,725个不是上下文的子字符串，导致最终的Ad-Hoc QuAD数据库大小为7,407。

创建数据集的代码可以在这里找到。

数据集策划者

该数据集由Moritz Scherrmann使用ChatGPT 3.5 turbo创建。

引用信息

@misc{scherrmann2023german,
      title={German FinBERT: A German Pre-trained Language Model}, 
      author={Moritz Scherrmann},
      year={2023},
      eprint={2311.08793},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

5,000+

优质数据集

54 个

任务类型

进入经典数据集