A Comprehensive Dataset of Classified Citations with Identifiers from Multilingual Wikipedia (2024)
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/11061211
下载链接
链接失效反馈官方服务:
资源简介:
This is a collection of translated citation datasets extracted from the Multilingual Wikipedia February 2024 dumps. The same extraction and template harmonization pipeline was used as for English Wikipedia https://zenodo.org/records/10782978.
Note: Version 2 fixes issue with Italian and French datasets that were corrupted (failed to upload in full) in the initial version.
In each language, Wikipedia authors can cite sources using language-specific or English templates. Our main effort in compiling these datasets was to assemble lists of citation templates for each language and convert relevant fields into a common English template. We started with known citation templates per each language (typically covering books, journals, web pages and news), and, in some cases, augmented these lists with additional frequently used templates (films, links, webarchives, etc.) which we were able to locate via the XML reference tags vs usage frequency dictionaties. For the list of accepted templates see our source code: https://github.com/albatros13/wikicite/tree/multilang (templates are listed in __init__.py files of the wikiciteparser library).
A classification label is assigned to each citation (either 'news', 'book', 'journal' or 'other)' by the deterministic rule-based classifier that analyses available identifiers (see code documentation for details). Please note that these numbers do not represent the overall estimation of the book and journal citation numbers. We count only citations with DOI, PMID, PMC and ISBN identifiers assigned by authors (prior to the lookup process that augments citations with missing identifiers). The number of news citations is dependent on our list of recognised 22.646 news agency domains.
Language
Acronym
Link
Dump size
Citations
Books
Journals
News
German
de
https://dumps.wikimedia.org/dewiki/20240220/
6.7GB
4.854.945
320.179
105.542
901.091
French
fr
https://dumps.wikimedia.org/frwiki/20240220/
5.9GB
9.552.768
798.525
264.560
1.907.183
Russian
ru
https://dumps.wikimedia.org/ruwiki/20240220/
5.1GB
7.437.100
420.828
130.470
1.370.665
Spanish
es
https://dumps.wikimedia.org/eswiki/20240220/
4.2GB
6.918.442
522.910
213.767
1.699.396
Italian
it
https://dumps.wikimedia.org/itwiki/20240220/
3.6GB
5.545.082
384.816
128.366
917.517
Polish
pl
https://dumps.wikimedia.org/plwiki/20240220/
2.4GB
4.744.158
463.783
95.988
513.006
Portuguese
pt
https://dumps.wikimedia.org/ptwiki/20240220/
2.2GB
4.775.025
243.593
142.216
1.176.140
Dutch
nl
https://dumps.wikimedia.org/nlwiki/20240220/
1.8GB
566.549
27.074
12.706
114.110
Swedish
sv
https://dumps.wikimedia.org/svwiki/20240220/
1.5GB
3.802.416
112.748
155.740
869.662
Catalan
ca
https://dumps.wikimedia.org/cawiki/20240220/
1.2GB
2.239.714
261.779
105.125
423.241
Finnish
fi
https://dumps.wikimedia.org/fiwiki/20240220/
900.9MB
1.697.731
209.556
12.068
286.420
Turkish
tr
https://dumps.wikimedia.org/trwiki/20240220
883.9MB
1.993.177
85.079
56.202
339.122
Norwegian
no
https://dumps.wikimedia.org/nowiki/20240220
763.7MB
796.500
43.314
12.373
151.780
Danish
da
https://dumps.wikimedia.org/dawiki/20240220
413.3MB
437.239
23.303
7.522
70.760
This datasets can be equipped with identifiers located via the lookup process (no 'acquired_ID_list' field). If there is interest in augmented versions, see the source code for instructions or contact authors for assistance with this task.
This research was supported in part by the University of Amsterdam Data Science Centre.
创建时间:
2024-05-13



