five

dataset_eastern_german_crisis_discourse_1976-86

收藏
Mendeley Data2024-06-27 更新2024-06-27 收录
下载链接:
https://zenodo.org/records/7690021
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset has been considered suitable for exploring and analysing the Eastern German Crisis Discourse from 1976 to 1986, and consists of important speeches contained in five volumes of the protocol of the party congress of the Socialist Unity Party of Germany (SED). The speeches have been digitized by scanning them into PDF files and then converting them into machine-readable TEXT files using OCR software. These TEXT files have been processed with TagAnt (v.2.0.4 Windows 10 64-bit), an annotation software. According to the data returned from AntConc (v.4.0.5 Windows 10 64-bit) the dataset contains four corpora: a main corpus with a total of 184,750 tokens and 16,143 types, a 'corpus A' with 70,533 tokens and 11,964 types, a 'corpus B' with 65,757 tokens and 11,967 types, and a 'corpus C' with 48,460 tokens and 8,145 types.

本数据集适用于1976年至1986年东德危机话语的探索与分析,其数据源自德意志统一社会党(Socialist Unity Party of Germany, SED)代表大会五卷本会议纪要中的重要讲话。本次收录的讲话已完成数字化处理:先通过扫描生成PDF文件,再借助光学字符识别(Optical Character Recognition, OCR)软件将其转换为机器可读文本文件。上述文本文件已通过标注软件TagAnt(v.2.0.4,适用于Windows 10 64位系统)完成处理。根据AntConc(v.4.0.5,适用于Windows 10 64位系统)的分析结果,本数据集包含四个语料库:主语料库总词次(tokens)为184,750,词型(types)为16,143;语料库A含70,533个词次、11,964个词型;语料库B含65,757个词次、11,967个词型;语料库C含48,460个词次、8,145个词型。
创建时间:
2024-05-10
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作