five

UNESCO's Proceedings, 1945-2017: A Bilingual Digital Text Corpus

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14786688
下载链接
链接失效反馈
官方服务:
资源简介:
The minutes of the meetings of the General Conference of UNESCO offer a rich resource for research on global themes in the humanities. UNESCO has published the minutes of these meetings (the “verbatim record”) since 1947 in a series called Records of the General Conference: Proceedings. UNESCO makes a portion of the Proceedings volumes available online in PDF form via the UNESDOC digital library. These files make it possible for users to read selected volumes, but they do not allow for full-text searching, much less any more sophisticated computational text analysis methods. This corpus assembles the texts of the “verbatim record” section from all issues of Proceedings from 1947 to 2017, in English and/or French, generating a text corpus that is machine-readable, accessible, and reusable for digital text analysis. Proceedings was published in parallel English and French editions from 1947 to 1962. Since then, it has appeared in a single multilingual volume including interventions in UNESCO’s six official languages, four of which (Arabic, Chinese, Russian and Spanish) are translated into either English or French. We deploy a language-recognition algorithm to isolate the text sections in English and French, thus creating a single bilingual corpus of circa 21 millions words that includes all interventions made at these meetings. Our Proceedings package on GitHub also includes: (1) the corpus, in both English and French; (2) code written to curate the corpus; (3) metadata files identifying each session and meeting; and (4) supplementary materials, such as documentation and quality control files. Our goal in creating this package has been to make this valuable source accessible for new forms of digital research. This corpus is, naturally, a preliminary version. Much work can still be done to fine-tune the language recognition and improve the quality of the corpus as a whole. The text of Proceedings is available in Open Access under the Attribution-ShareAlike 3.0 IGO (CC-BY-SA 3.0 IGO) license, in the context of UNESCO's open access publications policy. Our corpus is published under the most recent version of the same license: Attribution-ShareAlike 4.0 International (CC BY-SA 4.0 Deed). This corpus and related materials were developed as part of the research project "International Ideas at UNESCO: Digital Approaches to Global Conceptual History" (INIDUN), led by Benjamin G. Martin at Uppsala University and funded by a grant from the Swedish Research Council (Vetenskapsrådet dnr. 2019-03278), 2020-2024. For more information, see: inidun.github.io, as well as the project repository on GitHub, which includes documentation and files related to the curating process.
创建时间:
2025-02-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作