five

Multilingual comparable corpora of parliamentary debates ParlaMint 3.0

收藏
hdl.handle.net2025-01-09 收录
下载链接:
http://hdl.handle.net/11356/1486
下载链接
链接失效反馈
官方服务:
资源简介:
ParlaMint 3.0 is a multilingual set of 26 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2022, with the individual corpora being between 9 and 125 million words in size. The corpora have extensive metadata, including aspects of the parliament; the speakers (name, gender, MP status, party affiliation, party coalition/opposition); are structured into time-stamped terms, sessions and meetings; and with speeches being marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. Note that some corpora have further information, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The corpora are also marked to the subcorpus they belong to ("reference", until 2020-01-30, "covid", from 2020-01-31, and "war", from 2022-02-24). The corpora are encoded according to the Parla-CLARIN TEI recommendation (https://clarin-eric.github.io/parla-clarin/), but have been encoded against the compatible, but much stricter ParlaMint encoding guidelines (https://clarin-eric.github.io/ParlaMint/) and schemas (included in this distribution. This entry contains the ParlaMint TEI-encoded corpora with the derived plain text versions of the corpora along with TSV metadata of the speeches. Also included is the 3.0 release of the data and scripts available at the GitHub repository of the ParlaMint project. Note that there also exists the linguistically marked-up version of the corpus, which is available at http://hdl.handle.net/11356/1488. As opposed to the previous version 2.1, this version extends the corpus dates to (at least) mid 2022, does not contain the corpora for ES (Spanish) and Lithuanian (LT), and adds corpora for AT (Austria), BA (Bosnian), ES-CT (Catalonia), ES-GA (Galicia), GR (Greece), NO (Norway), PT (Portugal), RS (Serbia), SE (Sweden), and UA (Ukraine). The TEI encoding of some details has also changed.

ParlaMint 3.0乃一套多语言可比语料库集,包含26个语料库,主要收录自2015年起至2022年中期为止的议会辩论。各语料库规模介于9000万至1.25亿词之间。该语料库集拥有丰富的元数据,涵盖议会各层面;发言者信息(姓名、性别、议员身份、党派归属、党派联盟/反对派);结构化分为时间标记的条款、会议和会期;演讲内容以发言者和其角色(例如:主席、常规发言者)进行标注。演讲中还包含标注的转录员注释,如转录中的空白、中断、掌声等。请注意,部分语料库包含额外信息,例如发言者的出生年份、其维基百科文章链接、在各个委员会的成员资格等。语料库还根据所属子语料库进行标注(“参考”,截至2020年1月30日,“新冠疫情”,自2020年1月31日起,“战争”,自2022年2月24日起)。语料库根据Parla-CLARIN TEI推荐规范(https://clarin-eric.github.io/parla-clarin/)进行编码,但已针对兼容但更为严格的ParlaMint编码指南(https://clarin-eric.github.io/ParlaMint/)和方案(包含在本分布中)进行编码。本条目包含ParlaMint TEI编码语料库及其衍生的纯文本版本,以及演讲的TSV元数据。此外,还包括ParlaMint项目GitHub仓库中可用的数据及其脚本3.0版本。请注意,还存在着该语料库的语言学标注版本,可通过http://hdl.handle.net/11356/1488获取。与之前的2.1版本相比,本版本将语料库日期扩展至(至少)2022年中期,不包含西班牙语(ES)和立陶宛语(LT)的语料库,并新增了奥地利语(AT)、波斯尼亚语(BA)、加泰罗尼亚语(ES-CT)、加利西亚语(ES-GA)、希腊语(GR)、挪威语(NO)、葡萄牙语(PT)、塞尔维亚语(RS)、瑞典语(SE)和乌克兰语(UA)的语料库。部分细节的TEI编码也有所改变。
提供机构:
hdl.handle.net
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作