Parliamentary corpus of first Yugoslavia (1919-1939) yu1Parl 1.0
收藏SSH Open MarketPlace2025-07-04 更新2025-07-05 收录
下载链接:
https://marketplace.sshopencloud.eu/dataset/04BIwz
下载链接
链接失效反馈官方服务:
资源简介:
This historical parliamentary corpus contains meeting proceedings of the National Representation of the Kingdom of Yugoslavia from 191 to 1939. The corpus comprises 714 sessions.
The source data (scanned images of printed Stenographic Minutes) come from the [History of Slovenia - SIstory](https://www.sistory.si) portal. The images were OCR processed and the results saved as pdf, docx and txt. The documents are multilingual, in Serbo-Croatian and Slovenian, depending on the speaker. Serbo-Croatian is typeset in the Cyrillic (Serbian) or in the Latin (Croatian) alphabet.
The documents were automatically processed and the following data extracted: titles, agenda, attending, start and end of the session, speakers, and comments. [Lingua](https://github.com/pemistahl/lingua-py) was used for language detection on the sentence level. Roughly 59% of sentences are in Serbian (Cyrillic script), 38% in Croatian (Latin script) and 3% in Slovenian. Some sentences in German and French were also detected. Linguistic annotation (tokenisation, MSD tagging and lemmatisation) was added using [CLASSLA](https://github.com/clarinsi/classla) for Serbian, Croatian and Slovenian. Words in Serbian (Cyrillic script) have lemmas in Latin script.
The corpus is available for download from the CLARIN.SI repository as well as for online browsing through the noSketch Engine and KonText concordancers.
本历史议会语料库收录了191年至1939年南斯拉夫王国国民代表会议的议事记录,共计714次会议。源数据为印刷速记纪要的扫描图像,源自[斯洛文尼亚历史——SIstory](https://www.sistory.si)门户网站。该批图像经光学字符识别(Optical Character Recognition, OCR)处理后,其识别结果分别保存为PDF、DOCX及TXT格式文件。本语料库文档为多语言版本,根据发言者的不同,采用塞尔维亚-克罗地亚语或斯洛文尼亚语;其中塞尔维亚-克罗地亚语根据使用场景分别采用西里尔字母(塞尔维亚语环境)或拉丁字母(克罗地亚语环境)排版。本批次文档经自动化处理后,提取得到以下元数据:文档标题、会议议程、出席人员、会议起止时间、发言者及批注内容。本次语料库采用[Lingua](https://github.com/pemistahl/lingua-py)工具完成句子层面的语言检测。经统计,约59%的句子为西里尔字母书写的塞尔维亚语,38%为拉丁字母书写的克罗地亚语,3%为斯洛文尼亚语,同时还检测到少量德语与法语句子。针对塞尔维亚语、克罗地亚语及斯洛文尼亚语,本语料库使用[CLASSLA](https://github.com/clarinsi/classla)工具完成了语言学标注,涵盖分词(tokenisation)、形态句法标注(MSD tagging)及词形还原(lemmatisation)。其中采用西里尔字母书写的塞尔维亚语词汇,其词形还原结果以拉丁字母形式呈现。本语料库可从CLARIN.SI知识库下载获取,同时支持通过noSketch Engine与KonText共现索引工具进行在线浏览。
创建时间:
2025-07-04



