five

wirthual/dip-bundestag

收藏
Hugging Face2023-08-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/wirthual/dip-bundestag
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - de tags: - government - bundestag size_categories: - 10K<n<100K --- # Dataset Card for Dataset Name ## Dataset Description - **Homepage:** Quelle: https://dip.bundestag.de/ ### Dataset Summary Extracted the pdf documents from the DIP Service. In the current version, the documents are between the following dates: START_DATE = "2015-05-07" END_DATE = "2023-07-09" Distribution over the document types: ![Document Type Distribution](https://huggingface.co/datasets/wirthual/dip-bundestag/resolve/main/distribution.png) ### Languages German ## Dataset Structure Each row of the dataset consists of the following fields: doc_id and text The document id can be used to retrieve the metadata for the underlying PDF file by sending a request to the follwoing endpoint: https://search.dip.bundestag.de/api/v1/swagger-ui/#/Drucksachen/getDrucksache ### Data Fields index doc_id text ### Data Splits No split ## Dataset Creation Download PDF, extract text, [dehyphenize](https://github.com/pd3f/dehyphen) text Text got extracted within the boarders of the following rectangle to avoid header/footer text: <img src="https://huggingface.co/datasets/wirthual/dip-bundestag/resolve/main/rect_for_extraction.png" alt="Rectangle of extraction" style="width:400px;"/> ## Dataset Curation At this point, no complex curation of the dataset was performed. #### Who are the source language producers? https://dip.bundestag.de/ ### Licensing Information Quelle: Deutscher Bundestag/Bundesrat – DIP / "Bundestags-Drucksache" For further detail see: https://dip.bundestag.de/documents/nutzungsbedingungen_dip.pdf
提供机构:
wirthual
原始信息汇总

数据集概述

数据集描述

  • 语言: 德语
  • 标签: 政府, 联邦议院
  • 大小分类: 10K<n<100K

数据集摘要

本数据集从DIP服务中提取了PDF文档。文档日期范围为2015-05-07至2023-07-09。文档类型分布如下图所示: Document Type Distribution

数据集结构

  • 每行数据包含字段: doc_id, text
  • 文档ID用途: 可用于从以下端点获取PDF文件的元数据: https://search.dip.bundestag.de/api/v1/swagger-ui/#/Drucksachen/getDrucksache

数据字段

  • index
  • doc_id
  • text

数据分割

无分割

数据集创建

  • 处理步骤: 下载PDF, 提取文本, 去连字符处理
  • 文本提取区域: 如下图所示,以避免头部和尾部文本: <img src="https://huggingface.co/datasets/wirthual/dip-bundestag/resolve/main/rect_for_extraction.png" alt="Rectangle of extraction" style="width:400px;"/>

数据集精选

目前,数据集未进行复杂精选处理。

源语言生产者

https://dip.bundestag.de/

许可信息

  • 来源: 德国联邦议院/联邦参议院 – DIP / "Bundestags-Drucksache"
  • 详细信息: 参见 https://dip.bundestag.de/documents/nutzungsbedingungen_dip.pdf
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作