A Comprehensive Dataset of Classified Citations with Identifiers from Multilingual Wikipedia (2024)

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/11061211

下载链接

链接失效反馈

官方服务：

资源简介：

This is a collection of translated citation datasets extracted from the Multilingual Wikipedia February 2024 dumps. The same extraction and template harmonization pipeline was used as for English Wikipedia https://zenodo.org/records/10782978. Note: Version 2 fixes issue with Italian and French datasets that were corrupted (failed to upload in full) in the initial version. In each language, Wikipedia authors can cite sources using language-specific or English templates. Our main effort in compiling these datasets was to assemble lists of citation templates for each language and convert relevant fields into a common English template. We started with known citation templates per each language (typically covering books, journals, web pages and news), and, in some cases, augmented these lists with additional frequently used templates (films, links, webarchives, etc.) which we were able to locate via the XML reference tags vs usage frequency dictionaties. For the list of accepted templates see our source code: https://github.com/albatros13/wikicite/tree/multilang (templates are listed in __init__.py files of the wikiciteparser library). A classification label is assigned to each citation (either 'news', 'book', 'journal' or 'other)' by the deterministic rule-based classifier that analyses available identifiers (see code documentation for details). Please note that these numbers do not represent the overall estimation of the book and journal citation numbers. We count only citations with DOI, PMID, PMC and ISBN identifiers assigned by authors (prior to the lookup process that augments citations with missing identifiers). The number of news citations is dependent on our list of recognised 22.646 news agency domains. Language Acronym Link Dump size Citations Books Journals News German de https://dumps.wikimedia.org/dewiki/20240220/ 6.7GB 4.854.945 320.179 105.542 901.091 French fr https://dumps.wikimedia.org/frwiki/20240220/ 5.9GB 9.552.768 798.525 264.560 1.907.183 Russian ru https://dumps.wikimedia.org/ruwiki/20240220/ 5.1GB 7.437.100 420.828 130.470 1.370.665 Spanish es https://dumps.wikimedia.org/eswiki/20240220/ 4.2GB 6.918.442 522.910 213.767 1.699.396 Italian it https://dumps.wikimedia.org/itwiki/20240220/ 3.6GB 5.545.082 384.816 128.366 917.517 Polish pl https://dumps.wikimedia.org/plwiki/20240220/ 2.4GB 4.744.158 463.783 95.988 513.006 Portuguese pt https://dumps.wikimedia.org/ptwiki/20240220/ 2.2GB 4.775.025 243.593 142.216 1.176.140 Dutch nl https://dumps.wikimedia.org/nlwiki/20240220/ 1.8GB 566.549 27.074 12.706 114.110 Swedish sv https://dumps.wikimedia.org/svwiki/20240220/ 1.5GB 3.802.416 112.748 155.740 869.662 Catalan ca https://dumps.wikimedia.org/cawiki/20240220/ 1.2GB 2.239.714 261.779 105.125 423.241 Finnish fi https://dumps.wikimedia.org/fiwiki/20240220/ 900.9MB 1.697.731 209.556 12.068 286.420 Turkish tr https://dumps.wikimedia.org/trwiki/20240220 883.9MB 1.993.177 85.079 56.202 339.122 Norwegian no https://dumps.wikimedia.org/nowiki/20240220 763.7MB 796.500 43.314 12.373 151.780 Danish da https://dumps.wikimedia.org/dawiki/20240220 413.3MB 437.239 23.303 7.522 70.760 This datasets can be equipped with identifiers located via the lookup process (no 'acquired_ID_list' field). If there is interest in augmented versions, see the source code for instructions or contact authors for assistance with this task. This research was supported in part by the University of Amsterdam Data Science Centre.

创建时间：

2024-05-13

5,000+

优质数据集

54 个

任务类型

进入经典数据集