wirthual/dip-bundestag-qa
收藏Hugging Face2023-08-02 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/wirthual/dip-bundestag-qa
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- question-answering
language:
- de
tags:
- government
- bundestag
size_categories:
- 10K<n<100K
---
# Dataset Card for Dataset Name
## Dataset Description
- **Homepage:** Quelle: https://dip.bundestag.de/
### Dataset Summary
Extracted the questions and answers from the DIP Service. All PDFs which are used are of type "Antwort".
In the current version the documents are between the following dates:
START_DATE = "2015-05-07"
END_DATE = "2023-07-09"
### Languages
German
## Dataset Structure
Each row of the dataset consists of the following fields: question, answer and document id.
The document id can be used to retrieve the meta data for the underlying PDF file by sending a request to the follwoing endpoint:
https://search.dip.bundestag.de/api/v1/swagger-ui/#/Drucksachen/getDrucksache
### Data Fields
Question
Answer
doc_id
### Data Splits
No split
## Dataset Creation
Download PDF, extract text, Classify entries based on font size, [dehyphenize](https://github.com/pd3f/dehyphen) text, build pairs when possible.
## Dataset Curation
At this point, no complex curation of the dataset was performed.
Answers which simply referred to other answers were filtered out by these regexes:
```
'^Auf die Antwort.*verwiesen.$'
'^Es wird auf die Antwort.*verwiesen.$'
```
#### Who are the source language producers?
https://dip.bundestag.de/
### Licensing Information
Quelle: Deutscher Bundestag/Bundesrat – DIP / "Bundestags-Drucksache"
For further detail see:
https://dip.bundestag.de/documents/nutzungsbedingungen_dip.pdf
提供机构:
wirthual
原始信息汇总
数据集卡片
数据集描述
数据集概述
从DIP服务中提取的问题和答案。所有使用的PDF文件类型为“Antwort”。当前版本中的文档日期范围为:
- 开始日期:2015-05-07
- 结束日期:2023-07-09
语言
德语
数据集结构
每行数据包含以下字段:
- 问题
- 答案
- 文档ID
文档ID可用于通过向以下端点发送请求来检索底层PDF文件的元数据:
- 端点:https://search.dip.bundestag.de/api/v1/swagger-ui/#/Drucksachen/getDrucksache
数据字段
- 问题
- 答案
- 文档ID
数据分割
无分割
数据集创建
下载PDF,提取文本,根据字体大小分类条目,去连字符文本,尽可能构建对。
数据集精选
目前,未执行复杂的数据集精选。通过以下正则表达式过滤掉仅引用其他答案的答案:
^Auf die Antwort.*verwiesen.$ ^Es wird auf die Antwort.*verwiesen.$
源语言生产者
https://dip.bundestag.de/
许可信息
来源:Deutscher Bundestag/Bundesrat – DIP / "Bundestags-Drucksache" 更多详情请参见: https://dip.bundestag.de/documents/nutzungsbedingungen_dip.pdf



