German Parliament Corpus (GERPARCOR)
收藏arXiv2022-04-22 更新2024-06-21 收录
下载链接:
https://github.com/texttechnologylab/GerParCor
下载链接
链接失效反馈官方服务:
资源简介:
德国议会语料库(GERPARCOR)是由法兰克福歌德大学创建的一个特定体裁的语料库,主要包含德语国家的历史议会记录,涵盖三个世纪和四个国家。该数据集不仅包括扫描的议会记录,还特别包含了通过TESSERACT OCR处理转换的Fraktur字体记录。所有记录都通过spaCy3 NLP管道进行了预处理,并自动添加了会议日期等元数据。GERPARCOR适用于政治沟通领域的各种NLP任务,旨在解决德语区议会文本数据统一访问和分析的需求。
The German Parliamentary Corpus (GERPARCOR) is a genre-specific corpus developed by Goethe University Frankfurt. It mainly comprises historical parliamentary proceedings from German-speaking countries, spanning three centuries and four nations. This corpus not only includes scanned parliamentary records, but also specifically features Fraktur-typeface records that have been transcribed and processed via Tesseract OCR. All entries have been preprocessed using the spaCy3 NLP pipeline, with metadata such as meeting dates automatically appended. GERPARCOR is suitable for a wide range of NLP tasks in the field of political communication, and aims to address the demand for unified access and analysis of parliamentary text data in German-speaking regions.
提供机构:
法兰克福歌德大学
创建时间:
2022-04-22



