A Greek Parliament Proceedings Dataset for Computational Linguistics and Political Analysis
收藏NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/6626315
下载链接
链接失效反馈官方服务:
资源简介:
The dataset includes the following files:
1. tell_all_cleaned.csv: This is the file of the main dataset that includes 1,280,918 speech fragments of Greek parliament members in the order of the conversation that took place, exported from 5,355 parliamentary sitting record files, with a total volume of 2.12 GB. The speeches extend chronologically from July 1989 up to July 2020 and include the following information:
member_name: the name of the individual who spoke during a sitting.
sitting_date: the date the sitting took place.
parliamentary_period: the name and/or number of the parliamentary period that the speech took place in. A parliamentary period is defined as the time span between one general election and the next. A parliamentary period includes multiple parliamentary sessions.
parliamentary_session: the name and/or number of the parliamentary session that the speech took place in. A session is defined as a time span of usually 10 months within a parliamentary period during which the parliament can convene and function as stipulated by the constitution. A session can fall into the following categories: regular, extraordinary or special. In the intervals between the sessions the parliament is in recess. A parliamentary session includes multiple parliamentary sittings.
parliamentary_sitting: the name and/or number of the parliamentary sitting that the speech took place in. A sitting is defined as a meeting of parliament members.
political_party: the political party of the speaker.
government: the government in force when the speech took place.
member_region: the electoral district the speaker belonged to.
roles: information about the parliamentary roles and/or government position of the speaker.
member_gender: the gender of the speaker
speech: the speech that the individual gave during the parliamentary sitting.
2. wiki_data: A folder of modern Greek female and male names and surnames and their available grammatical cases crawled from the entries of the Wiktionary Greek names category (https://en.wiktionary.org/wiki/Category:Greek_names). We produced the grammatical cases of the missing grammatical entries according to the rules of the Greek grammar and saved the files in the same folder by adding to their filenames the string "_populated.json".
3. parl_members_activity_1989onwards_with_gender.csv: The Greek Parliament website provides a
list of all the elected members of parliament since the fall of the military junta in Greece, in 1974. We collected and cleaned the data, added the gender and kept the elected members from 1989 onwards, matching the available parliament proceeding records. This dataset includes the full names of the members, the date range of their service, the political party they served, the electoral district they belonged to and their gender.
4. formatted_roles_gov_members_data.csv: As government members we refer to individuals in ministerial or other government posts, regardless of whether they were elected in the parliament. This information is available in the website of the Secretariat General for Legal and Parliamentary Affairs. The government members dataset includes the full names of the official individuals, the name of the role they were given, the date range of their service at each specific role and their gender.
5. governments_1989onwards.csv: A dataset of government information including the names of governments since 1989, their start and end dates, and a URL that points to the respective official government web page of each past government. The data is crawled from the website of the Secretariat General for Legal and Parliamentary Affairs.
6. extra_roles_manually_collected.csv: A dataset with manually collected information from Wikipedia about additional government or parliament posts such as Chairman of the Parliament, party leaders, opposition leaders and other information.
7. all_members_activity.csv: A dataset of all the information of the aforementioned files 3,4,5,6 merged. Each row of the file includes the full name of the individual, the start and end date of their term of office, the political party and electoral district they belonged to, their gender, the parliamentary and/or government positions that they held along with start and end dates, and the name of the government that was in power during their term of office. An individual can change political parties or become an independent member of the parliament during a parliamentary period, thus having more than one entries/rows in the file.
8. freqs_for_semantic_shift_cleaned_data_decade1990.csv & freqs_for_semantic_shift_cleaned_data_decade2010.csv: Files of frequencies of words in the corpora of the decades 1990-1999 and 2010-2019.
9. compass_top100.csv: Top 100 most changed words between the decades 1990-1999 and 2010-2019, as computed with the use of the Compass tool by V. D. Carlo et. al. [1].
10. compass_fc_top100.csv: Top 100 most changed words between the decades 1990-1999 and 2010-2019, as computed with the use of the Compass tool [1] in combination with the frequency cut-offs of the Gonen et. al. approach [3]. For the frequency cut-offs, the files in bullet 8 are used.
11. procrustes_top100.csv: Top 100 most changed words between the decades 1990-1999 and 2010-2019, as computed with the use of the Orthogonal Procrustes approach of Hamilton et. al. [2].
12. nn_top100.csv: Top 100 most changed words between the decades 1990-1999 and 2010-2019, as computed with the use of the Gonen et. al. approach [3].
13. second_order_top100.csv: Top 100 most changed words between the decades 1990-1999 and 2010-2019, as computed with the use of the Second-Order Similarity approach by Hamilton et. al. [4].
14. top100_minfreq50.xls: An .xls file for convinient viewing of the top 100 most changed words per approach with minimum frequency of 50 occurrences, produced by merging the aforementioned files 9,10,11,12 and 13.
15. freqs_for_semantic_shift_cleaned_data_period1997_2007.csv & freqs_for_semantic_shift_cleaned_data_period2008_2018.csv: Files of frequencies of words in the corpora of the decades before (1997_2007) and during (2008_2018) the Greek economic crisis.
16. semantic_shifts_dichotomy_crisis_compass_1997_2007_2008_2018_atleast50.csv: A file with the top 100 most changed words between between the decades before (1997-2007) and during (2008-2018) the Greek economic crisis. The computations are implemented with the use of the Compass tool.
17. selected_topics_shift_per_period_compass.csv: The usage change of selected topics/words of generic political interest between pairs of consecutive parliamentary periods. The computations are implemented with the use of the Compass tool.
18. semantic_shifts_party_embeddings_per_period_merged_compass.csv: The usage change of selected political party names that have played an important role in recent political history, namely New Democracy (ND), the Panhellenic Socialist Movement (PASOK), the Coalition of the Radical Left - Progressive Alliance (SYRIZA), the Communist Party of Greece (KKE), the Coalition of the Left, of Movements and Ecology (SYN) and Golden Dawn (GD).
-------------
Citations:
[1] Valerio Di Carlo, Federico Bianchi, and Matteo Palmonari. Training Temporal Word Em- beddings with a Compass. In Proceedings of the Thirty–Third AAAI Conference on Artificial Intelligence, AAAI’19, pages 6326–6334, 2019. doi: 10.1609/aaai.v33i01.33016326.
[2] William L. Hamilton, Jure Leskovec, and Dan Jurafsky. Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2016, pages 1489– 1501, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10. 18653/v1/P16-1141. URL https://www.aclweb.org/anthology/P16-1141.
[3] Hila Gonen, Ganesh Jawahar, Djamé Seddah, and Yoav Goldberg. Simple, Interpretable and Stable Method for Detecting Words with Usage Change across Corpora. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, pages 538– 555, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl- main.51. URL https://aclanthology.org/2020.acl-main.51.
[4] William L. Hamilton, Jure Leskovec, and Dan Jurafsky. Cultural Shift or Linguistic Drift? Comparing Two Computational Measures of Semantic Change. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, pages 2116–2121, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1229. URL https://www.aclweb.org/anthology/D16-1229.
-------------
Acknowledgments:
This work was supported by the European Union’s Horizon 2020 research and innovation program ``FASTEN'' under grant agreement No 825328 and the non profit data journalism organization iMEdD.org.
本数据集包含以下文件:
1. tell_all_cleaned.csv: 这是本数据集的主文件,包含1,280,918条希腊议会议员的语音片段,按实际对话顺序排列,源自5,355份议会议事记录文件,总大小为2.12 GB。所有演讲的时间跨度为1989年7月至2020年7月,包含以下信息:
member_name:议事会议中发言者的姓名
sitting_date:议事会议召开日期
parliamentary_period(议会届期):该演讲所属的议会届期名称及/或编号。议会届期的定义为两次大选之间的时间段,一届议会届期包含多个议会会议期。
parliamentary_session(议会会议期):该演讲所属的议会会议期名称及/或编号。议会会议期的定义为议会届期内通常为期10个月的时间段,在此期间议会可依据宪法规定召开会议并行使职能。会议期可分为常规、特别或临时三类。在会议期间隔期间,议会处于休会状态。一届议会会议期包含多场议会议事会议。
parliamentary_sitting(议会议事会议):该演讲所属的议会议事会议名称及/或编号。议事会议的定义为议会议员的集会。
political_party:发言者所属的政党
government:该演讲发表时的现任政府
member_region:发言者所属的选举选区
roles:发言者所担任的议会职务及/或政府职位信息
member_gender:发言者的性别
speech:该个体在议会议事会议上发表的演讲内容
2. wiki_data:一个文件夹,包含从维基词典希腊语姓名分类(https://en.wiktionary.org/wiki/Category:Greek_names)的条目爬取的现代希腊语男女姓名及姓氏及其所有可用语法变格形式。我们根据希腊语语法规则补全了缺失的变格形式,并在文件名后添加字符串"_populated.json"后保存至该文件夹中。
3. parl_members_activity_1989onwards_with_gender.csv:希腊议会官网提供了自1974年希腊军事独裁政权倒台以来所有当选议员的名单。我们收集并清洗了该数据,补充了性别信息,并保留了1989年之后的当选议员数据,与可获取的议会议事记录进行了匹配。本数据集包含议员的全名、任职时间范围、所属政党、所属选举选区以及性别。
4. formatted_roles_gov_members_data.csv:此处的政府成员指担任部长或其他政府职务的个体,无论其是否为议会当选议员。此类信息可从法律与议会事务总秘书处的官网获取。本政府成员数据集包含正式任职者的全名、所任职务名称、担任各职务的时间范围以及其性别。
5. governments_1989onwards.csv:本数据集包含1989年以来的政府信息,包括各届政府的名称、起止日期,以及指向各届前政府官方网页的链接。数据从法律与议会事务总秘书处的官网爬取获取。
6. extra_roles_manually_collected.csv:本数据集包含从维基百科手动收集的额外政府或议会职务信息,例如议会议长、政党领袖、反对党领袖及其他相关信息。
7. all_members_activity.csv:本数据集为上述第3、4、5、6号文件的合并完整信息文件。文件的每一行包含个体的全名、任职起止日期、所属政党与选举选区、性别、其所担任的议会及/或政府职务与对应起止日期,以及其任职期间的现任政府名称。个体可在议会届期内更换政党或成为无党派议员,因此在该文件中可能存在多条记录/行。
8. freqs_for_semantic_shift_cleaned_data_decade1990.csv & freqs_for_semantic_shift_cleaned_data_decade2010.csv:分别包含1990-1999年与2010-2019年两个十年语料库中的词汇频率数据。
9. compass_top100.csv:包含1990-1999年与2010-2019年两个十年间词汇语义变化最显著的前100个词汇,由V. D. Carlo等人提出的Compass工具计算得到[1]。
10. compass_fc_top100.csv:包含1990-1999年与2010-2019年两个十年间词汇语义变化最显著的前100个词汇,由Compass工具结合Gonen等人的方法[3]中的频率截断标准计算得到。频率截断标准使用上述第8号文件中的数据。
11. procrustes_top100.csv:包含1990-1999年与2010-2019年两个十年间词汇语义变化最显著的前100个词汇,由Hamilton等人提出的正交普罗克汝斯忒斯(Orthogonal Procrustes)方法计算得到[2]。
12. nn_top100.csv:包含1990-1999年与2010-2019年两个十年间词汇语义变化最显著的前100个词汇,由Gonen等人的方法[3]计算得到。
13. second_order_top100.csv:包含1990-1999年与2010-2019年两个十年间词汇语义变化最显著的前100个词汇,由Hamilton等人提出的二阶相似性(Second-Order Similarity)方法计算得到[4]。
14. top100_minfreq50.xls:一个.xls格式文件,用于便捷查看各方法下词汇语义变化最显著的前100个词汇(最低出现频率为50次),由上述第9、10、11、12、13号文件合并得到。
15. freqs_for_semantic_shift_cleaned_data_period1997_2007.csv & freqs_for_semantic_shift_cleaned_data_period2008_2018.csv:分别包含希腊经济危机前(1997-2007年)与危机期间(2008-2018年)的语料库词汇频率数据。
16. semantic_shifts_dichotomy_crisis_compass_1997_2007_2008_2018_atleast50.csv:包含希腊经济危机前(1997-2007年)与危机期间(2008-2018年)两个时间段间词汇语义变化最显著的前100个词汇,计算使用Compass工具完成。
17. selected_topics_shift_per_period_compass.csv:包含连续议会届期对之间的通用政治相关主题/词汇的使用变化情况,计算使用Compass工具完成。
18. semantic_shifts_party_embeddings_per_period_merged_compass.csv:包含在近期希腊政治史上发挥重要作用的政党名称的使用变化情况,涉及的政党包括新民主党(New Democracy, ND)、泛希腊社会主义运动(Panhellenic Socialist Movement, PASOK)、激进左翼联盟-进步联盟(Coalition of the Radical Left - Progressive Alliance, SYRIZA)、希腊共产党(Communist Party of Greece, KKE)、左翼运动与生态联盟(Coalition of the Left, of Movements and Ecology, SYN)以及金色黎明(Golden Dawn, GD)。
-------------
参考文献:
[1] Valerio Di Carlo, Federico Bianchi, 和 Matteo Palmonari. 基于Compass工具的时间词嵌入训练(Training Temporal Word Embeddings with a Compass). 发表于第三十三届AAAI人工智能大会会议论文集,AAAI’19,页码6326–6334,2019年。DOI: 10.1609/aaai.v33i01.33016326.
[2] William L. Hamilton, Jure Leskovec, 和 Dan Jurafsky. 历时词嵌入揭示语义变化的统计规律(Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change). 发表于第五十四届国际计算语言学协会年会会议论文集(第一卷:长文),ACL 2016,页码1489–1501,德国柏林,2016年8月。国际计算语言学协会。DOI: 10.18653/v1/P16-1141. 链接:https://www.aclweb.org/anthology/P16-1141.
[3] Hila Gonen, Ganesh Jawahar, Djamé Seddah, 和 Yoav Goldberg. 检测跨语料库词汇使用变化的简单、可解释且稳定的方法(Simple, Interpretable and Stable Method for Detecting Words with Usage Change across Corpora). 发表于第五十八届国际计算语言学协会年会会议论文集,ACL 2020,页码538–555,线上会议,2020年7月。国际计算语言学协会。DOI: 10.18653/v1/2020.acl-main.51. 链接:https://aclanthology.org/2020.acl-main.51.
[4] William L. Hamilton, Jure Leskovec, 和 Dan Jurafsky. 文化变迁还是语言演变?两种语义变化计算度量的比较(Cultural Shift or Linguistic Drift? Comparing Two Computational Measures of Semantic Change). 发表于2016年自然语言处理经验方法会议会议论文集,EMNLP 2016,页码2116–2121,美国德克萨斯州奥斯汀,2016年11月。国际计算语言学协会。DOI: 10.18653/v1/D16-1229. 链接:https://www.aclweb.org/anthology/D16-1229.
-------------
致谢:
本研究得到欧盟地平线2020研究与创新计划“FASTEN”(资助协议编号825328)以及非营利数据新闻组织iMEdD.org的支持。
创建时间:
2022-08-27



