wirthual/dip-bundestag
收藏Hugging Face2023-08-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/wirthual/dip-bundestag
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- de
tags:
- government
- bundestag
size_categories:
- 10K<n<100K
---
# Dataset Card for Dataset Name
## Dataset Description
- **Homepage:** Quelle: https://dip.bundestag.de/
### Dataset Summary
Extracted the pdf documents from the DIP Service.
In the current version, the documents are between the following dates:
START_DATE = "2015-05-07"
END_DATE = "2023-07-09"
Distribution over the document types:

### Languages
German
## Dataset Structure
Each row of the dataset consists of the following fields: doc_id and text
The document id can be used to retrieve the metadata for the underlying PDF file by sending a request to the follwoing endpoint:
https://search.dip.bundestag.de/api/v1/swagger-ui/#/Drucksachen/getDrucksache
### Data Fields
index
doc_id
text
### Data Splits
No split
## Dataset Creation
Download PDF, extract text, [dehyphenize](https://github.com/pd3f/dehyphen) text
Text got extracted within the boarders of the following rectangle to avoid header/footer text:
<img src="https://huggingface.co/datasets/wirthual/dip-bundestag/resolve/main/rect_for_extraction.png" alt="Rectangle of extraction" style="width:400px;"/>
## Dataset Curation
At this point, no complex curation of the dataset was performed.
#### Who are the source language producers?
https://dip.bundestag.de/
### Licensing Information
Quelle: Deutscher Bundestag/Bundesrat – DIP / "Bundestags-Drucksache"
For further detail see:
https://dip.bundestag.de/documents/nutzungsbedingungen_dip.pdf
提供机构:
wirthual
原始信息汇总
数据集概述
数据集描述
- 语言: 德语
- 标签: 政府, 联邦议院
- 大小分类: 10K<n<100K
数据集摘要
本数据集从DIP服务中提取了PDF文档。文档日期范围为2015-05-07至2023-07-09。文档类型分布如下图所示:

数据集结构
- 每行数据包含字段: doc_id, text
- 文档ID用途: 可用于从以下端点获取PDF文件的元数据: https://search.dip.bundestag.de/api/v1/swagger-ui/#/Drucksachen/getDrucksache
数据字段
- index
- doc_id
- text
数据分割
无分割
数据集创建
- 处理步骤: 下载PDF, 提取文本, 去连字符处理
- 文本提取区域: 如下图所示,以避免头部和尾部文本: <img src="https://huggingface.co/datasets/wirthual/dip-bundestag/resolve/main/rect_for_extraction.png" alt="Rectangle of extraction" style="width:400px;"/>
数据集精选
目前,数据集未进行复杂精选处理。
源语言生产者
https://dip.bundestag.de/
许可信息
- 来源: 德国联邦议院/联邦参议院 – DIP / "Bundestags-Drucksache"
- 详细信息: 参见 https://dip.bundestag.de/documents/nutzungsbedingungen_dip.pdf



