Parliamentary corpus of first Yugoslavia (1919-1939) yu1Parl 1.0

SSH Open MarketPlace2025-04-02 更新2025-04-05 收录

下载链接：

https://marketplace.sshopencloud.eu/dataset/E7bBq5

下载链接

链接失效反馈

官方服务：

资源简介：

This historical parliamentary corpus contains meeting proceedings of the National Representation of the Kingdom of Yugoslavia from 191 to 1939. The corpus comprises 714 sessions. The source data (scanned images of printed Stenographic Minutes) come from the [History of Slovenia - SIstory](https://www.sistory.si) portal. The images were OCR processed and the results saved as pdf, docx and txt. The documents are multilingual, in Serbo-Croatian and Slovenian, depending on the speaker. Serbo-Croatian is typeset in the Cyrillic (Serbian) or in the Latin (Croatian) alphabet. The documents were automatically processed and the following data extracted: titles, agenda, attending, start and end of the session, speakers, and comments. [Lingua](https://github.com/pemistahl/lingua-py) was used for language detection on the sentence level. Roughly 59% of sentences are in Serbian (Cyrillic script), 38% in Croatian (Latin script) and 3% in Slovenian. Some sentences in German and French were also detected. Linguistic annotation (tokenisation, MSD tagging and lemmatisation) was added using [CLASSLA](https://github.com/clarinsi/classla) for Serbian, Croatian and Slovenian. Words in Serbian (Cyrillic script) have lemmas in Latin script. The corpus is available for download from the CLARIN.SI repository as well as for online browsing through the noSketch Engine and KonText concordancers.

本历史议会语料库收录了191年至1939年南斯拉夫王国国民代表会议的议事记录，共计714次会议。原始数据为印刷版速记记录（Stenographic Minutes）的扫描图像，源自History of Slovenia - SIstory门户网站（https://www.sistory.si）。研究团队已对上述扫描图像开展光学字符识别（Optical Character Recognition, OCR）处理，并将识别结果保存为PDF、DOCX及TXT三种格式文件。本语料涵盖多语种内容，根据发言者不同分别采用塞尔维亚-克罗地亚语或斯洛文尼亚语。其中塞尔维亚-克罗地亚语可采用西里尔字母（塞尔维亚变体）或拉丁字母（克罗地亚变体）排版。研究团队对文档进行了自动化处理，提取出会议标题、议事日程、与会人员、会议起止时间、发言者及备注信息等元数据。采用Lingua工具（https://github.com/pemistahl/lingua-py）实现句子级语言检测，经统计约59%的句子为西里尔字母书写的塞尔维亚语，38%为拉丁字母书写的克罗地亚语，3%为斯洛文尼亚语，此外还检测到少量德语与法语句子。研究团队使用CLASSLA工具（https://github.com/clarinsi/classla）为塞尔维亚语、克罗地亚语及斯洛文尼亚语文本添加语言学标注，涵盖分词（Tokenisation）、MSD标记及词形还原（lemmatisation）。其中采用西里尔字母书写的塞尔维亚语词汇，其词元采用拉丁字母形式。本语料库可从CLARIN.SI知识库获取下载，同时支持通过noSketch Engine与KonText上下文索引平台进行在线浏览。

创建时间：

2025-04-02