Frontiers in Psychology Corpus (2010–2021)
收藏DataCite Commons2026-03-20 更新2026-05-04 收录
下载链接:
https://repod.icm.edu.pl/citation?persistentId=doi:10.18150/4LJ9WD
下载链接
链接失效反馈官方服务:
资源简介:
Frontiers in Psychology Corpus (2010–2021)A comprehensive text corpus of 21,084 papers published in Frontiers in Psychology between 2010 and 2021, processed for computational linguistics and semantic analysis.Dataset OverviewThis corpus contains full-text articles from Frontiers in Psychology, converted from XML to plain text and preprocessed for natural language processing tasks.Files and Structurefpsyg_filtered.zipContains the filtered text corpus with the following preprocessing applied:- XML to text conversion: Original XML documents converted to plain text format- Sentence segmentation: Text segmented into individual sentences- Boilerplate removal: Journal metadata, headers, footers, and other non-content elements removed using filter_text_corpus.pyfpsyg_tagged.zip.* (2 parts)Contains the linguistically annotated corpus. This archive is split into multiple parts due to size constraints:- fpsyg_tagged.zip.001- fpsyg_tagged.zip.002To extract: First combine the parts using 7-Zip:bash7z x fpsyg_tagged.zip.0017-Zip will automatically detect and combine all parts, then extract the contents.After extraction:- File: fpsyg_filtered_tagged.conllu- POS tags: Penn Treebank part-of-speech tags- Dependency parsing: Universal Dependency (UD) tags- Tagger: Processed using Stanza- Format: CoNLL-U formatfpsyg_index.zip.* (6 parts)Contains a compiled binary index for semantic mining. This archive is split into multiple parts:- fpsyg_index.zip.001 through fpsyg_index.zip.006To extract: First combine the parts using 7-Zip:bash7z x fpsyg_index.zip.0017-Zip will automatically detect and combine all parts, then extract the contents.After extraction:- Compatible with ConceptSketch semantic mining software- Usage: Point ConceptSketch to the extraction directoryProcessing PipelineOriginal XML → Text Conversion → Sentence Segmentation → Boilerplate Removal → POS/UD Tagging (Stanza) → Final CorpusLicenseThis dataset is distributed under the Creative Commons Attribution License (CC BY), derived from the original papers' licensing terms.CitationIf you use this dataset in your research, please cite:Dataset compiled by: Marcin MiłkowskiAffiliation: Cognitive Metascience Lab & Center for AI in Society, Institute of Philosophy and Sociology, Polish Academy of SciencesRequirementsStanza (for re-tagging or custom processing): https://github.com/stanfordnlp/stanzaConceptSketch (for semantic mining): https://github.com/cognitive-metascience/concept-sketchContactFor questions regarding this dataset, please contact the Cognitive Metascience Lab at the Institute of Philosophy and Sociology, Polish Academy of Sciences: https://cognitive-metascience.github.io
提供机构:
RepOD
创建时间:
2026-03-19



