Knesset Corpus
收藏arXiv2024-05-28 更新2024-06-17 收录
下载链接:
https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus
下载链接
链接失效反馈官方服务:
资源简介:
Knesset Corpus是由以色列海法大学计算机科学系等机构创建的大型数据集,包含超过3000万条以色列议会(Knesset)的议事记录,涵盖1998至2022年间的所有全体会议和委员会会议。数据集不仅包含庞大的文本量(约384亿个tokens),还附有详细的元数据,反映发言人的社会和政治属性。创建过程中,数据从原始的Word和PDF文件中提取并组织,经过严格的文本处理和质量控制。该数据集主要用于支持语言学、政治科学、法律和传播学等领域的研究,特别是分析政治讨论风格的历史演变和性别差异。
Knesset Corpus is a large-scale dataset developed by the Department of Computer Science of the University of Haifa in Israel and other affiliated institutions. It contains over 30 million transcripts of plenary and committee meetings of the Israeli Knesset, covering all such sessions held between 1998 and 2022. The dataset boasts a massive text volume (approximately 38.4 billion tokens) and is accompanied by detailed metadata that reflects the social and political attributes of speakers. During its creation, the data was extracted and organized from original Word and PDF files, followed by rigorous text processing and quality control procedures. This dataset is primarily used to support research in fields including linguistics, political science, law, and communication studies, especially for analyzing the historical evolution of political discourse styles and gender differences.
提供机构:
海法大学计算机科学系,以色列
创建时间:
2024-05-28



