Parliamentary corpus of first Yugoslavia (1919-1939) yu1Parl 1.0

SSH Open MarketPlace2023-10-17 更新2024-08-03 收录

下载链接：

https://marketplace.sshopencloud.eu/dataset/E7bBq5

下载链接

链接失效反馈

官方服务：

资源简介：

This historical parliamentary corpus contains meeting proceedings of the National Representation of the Kingdom of Yugoslavia from 191 to 1939. The corpus comprises 714 sessions. The source data (scanned images of printed Stenographic Minutes) come from the [History of Slovenia - SIstory](https://www.sistory.si) portal. The images were OCR processed and the results saved as pdf, docx and txt. The documents are multilingual, in Serbo-Croatian and Slovenian, depending on the speaker. Serbo-Croatian is typeset in the Cyrillic (Serbian) or in the Latin (Croatian) alphabet. The documents were automatically processed and the following data extracted: titles, agenda, attending, start and end of the session, speakers, and comments. [Lingua](https://github.com/pemistahl/lingua-py) was used for language detection on the sentence level. Roughly 59% of sentences are in Serbian (Cyrillic script), 38% in Croatian (Latin script) and 3% in Slovenian. Some sentences in German and French were also detected. Linguistic annotation (tokenisation, MSD tagging and lemmatisation) was added using [CLASSLA](https://github.com/clarinsi/classla) for Serbian, Croatian and Slovenian. Words in Serbian (Cyrillic script) have lemmas in Latin script. The corpus is available for download from the CLARIN.SI repository as well as for online browsing through the noSketch Engine and KonText concordancers.

创建时间：

2023-10-17