Frontiers in Psychology Corpus (2010–2021)

Name: Frontiers in Psychology Corpus (2010–2021)
Creator: RepOD
Published: 2026-03-20 08:44:39
License: 暂无描述

DataCite Commons2026-03-20 更新2026-05-04 收录

下载链接：

https://repod.icm.edu.pl/citation?persistentId=doi:10.18150/4LJ9WD

下载链接

链接失效反馈

官方服务：

资源简介：

Frontiers in Psychology Corpus (2010–2021)A comprehensive text corpus of 21,084 papers published in Frontiers in Psychology between 2010 and 2021, processed for computational linguistics and semantic analysis.Dataset OverviewThis corpus contains full-text articles from Frontiers in Psychology, converted from XML to plain text and preprocessed for natural language processing tasks.Files and Structurefpsyg_filtered.zipContains the filtered text corpus with the following preprocessing applied:- XML to text conversion: Original XML documents converted to plain text format- Sentence segmentation: Text segmented into individual sentences- Boilerplate removal: Journal metadata, headers, footers, and other non-content elements removed using filter_text_corpus.pyfpsyg_tagged.zip.* (2 parts)Contains the linguistically annotated corpus. This archive is split into multiple parts due to size constraints:- fpsyg_tagged.zip.001- fpsyg_tagged.zip.002To extract: First combine the parts using 7-Zip:bash7z x fpsyg_tagged.zip.0017-Zip will automatically detect and combine all parts, then extract the contents.After extraction:- File: fpsyg_filtered_tagged.conllu- POS tags: Penn Treebank part-of-speech tags- Dependency parsing: Universal Dependency (UD) tags- Tagger: Processed using Stanza- Format: CoNLL-U formatfpsyg_index.zip.* (6 parts)Contains a compiled binary index for semantic mining. This archive is split into multiple parts:- fpsyg_index.zip.001 through fpsyg_index.zip.006To extract: First combine the parts using 7-Zip:bash7z x fpsyg_index.zip.0017-Zip will automatically detect and combine all parts, then extract the contents.After extraction:- Compatible with ConceptSketch semantic mining software- Usage: Point ConceptSketch to the extraction directoryProcessing PipelineOriginal XML → Text Conversion → Sentence Segmentation → Boilerplate Removal → POS/UD Tagging (Stanza) → Final CorpusLicenseThis dataset is distributed under the Creative Commons Attribution License (CC BY), derived from the original papers' licensing terms.CitationIf you use this dataset in your research, please cite:Dataset compiled by: Marcin MiłkowskiAffiliation: Cognitive Metascience Lab & Center for AI in Society, Institute of Philosophy and Sociology, Polish Academy of SciencesRequirementsStanza (for re-tagging or custom processing): https://github.com/stanfordnlp/stanzaConceptSketch (for semantic mining): https://github.com/cognitive-metascience/concept-sketchContactFor questions regarding this dataset, please contact the Cognitive Metascience Lab at the Institute of Philosophy and Sociology, Polish Academy of Sciences: https://cognitive-metascience.github.io

提供机构：

RepOD

创建时间：

2026-03-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集