Ukrainian parliamentary corpus ParlaMint-UA 4.0.1
收藏hdl.handle.net2025-01-16 收录
下载链接:
http://hdl.handle.net/11356/1900
下载链接
链接失效反馈官方服务:
资源简介:
The Ukrainian parliamentary corpus ParlaMint-UA 4.0.1 is an extended version of the ParlaMint-UA 4.0 corpus (available as a collection of plain texts along with TSV metadata of the speeches http://hdl.handle.net/11356/1859 and as a collection of speeches with added automatic linguistic annotations http://hdl.handle.net/11356/1860, both being part of the “ParlaMint: Towards Comparable Parliamentary Corpora” project by CLARIN ERIC (https://www.clarin.eu/parlamint).
The Ukrainian parliamentary corpus ParlaMint-UA 4.0.1 contains plenary proceedings for the 4th, 5th, 6th, 7th, 8th and 9th terms of the Rada between 14 May 2002 and 10 November 2023. Tokens in Ukrainian comprise 94% and tokens in Russian comprise 6%.
The transcripts are grouped by dates with information on the term, session and meeting, and contain speeches marked by the speaker and their role (chair, regular speaker or guest). The speeches also contain marked-up transcriber comments, such as noise, applause, shouting, etc. The corpus has extensive metadata on speakers including their name, the year of birth (when available in open sources), gender, MP and minister status, and party affiliation (when known from open sources), and political parties, parliamentary factions and groups including their name, left-to-right political orientation (Wikipedia-sourced or manually encoded, when absent in Wikipedia) and coalition/opposition status.
The corpus is encoded according to the Parla-CLARIN TEI recommendation (https://clarin-eric.github.io/parla-clarin/), as well as following the much stricter ParlaMint encoding guidelines (https://clarin-eric.github.io/ParlaMint/) and schemas.
The corpus comes in two versions. One version contains plain texts of plenary speeches. The other version contains texts of the same plenary speeches that are linguistically annotated including tokenization; sentence segmentation; lemmatisation; Universal Dependencies part-of-speech, morphological features, and syntactic dependencies; and the 4-class CoNLL-2003 named entities.
Compared to ParlaMint-UA 4.0, the Ukrainian parliamentary corpus ParlaMint-UA 4.0.1 has doubled the time-span and now includes older data between 2002 and 2012 and more recent data between September and November 2023. It enhances language identification between Ukrainian and Russian from the paragraph level to the sentence level to advance research on code-switching in public discourse. Also, the errors found in ParlaMint 4.0 have been corrected.
乌克兰议会语料库ParlaMint-UA 4.0.1为ParlaMint-UA 4.0语料库的扩展版本(作为包括演讲纯文本及TSV元数据的集合http://hdl.handle.net/11356/1859,以及包含自动语言标注的演讲集合http://hdl.handle.net/11356/1860,两者均为“ParlaMint:构建可比议会语料库”项目的一部分,由CLARIN ERIC(https://www.clarin.eu/parlamint)发起)。
ParlaMint-UA 4.0.1乌克兰议会语料库包含自2002年5月14日至2023年11月10日间的第四、第五、第六、第七、第八和第九届议会大会记录。其中,乌克兰语Token占94%,俄语Token占6%。
转录文本按照日期分组,包含关于会期、会议以及发言者及其角色(主席、常规发言者或嘉宾)的信息。发言中还包含标注的转录者注释,例如噪音、掌声、呼喊声等。语料库包含关于发言者的丰富元数据,包括其姓名、出生年份(当公开来源中有记录时)、性别、议员及部长身份,以及政党归属(当从公开来源中得知时),以及政党、议会派系和团体,包括其名称、从左至右的政治倾向(来自维基百科或手动编码,当维基百科中不存在时)和联盟/反对派状态。
语料库按照Parla-CLARIN TEI推荐规范(https://clarin-eric.github.io/parla-clarin/)以及更为严格的ParlaMint编码指南(https://clarin-eric.github.io/ParlaMint/)和架构进行编码。
语料库提供两个版本。一个版本包含大会演讲的纯文本。另一个版本包含经过语言标注的大会演讲文本,包括分词;句子分割;词元化;通用依存关系标注的词性、形态特征和句法依存关系;以及4类CoNLL-2003命名实体。
与ParlaMint-UA 4.0相比,乌克兰议会语料库ParlaMint-UA 4.0.1的时间跨度翻倍,现在包括2002年至2012年的早期数据以及2023年9月至11月的新近数据。它提升了从段落级别到句子级别的乌克兰语与俄语的语言识别能力,以推进公共话语中代码切换的研究。同时,ParlaMint 4.0中发现的错误也得到了纠正。
提供机构:
hdl.handle.net



