five

Frontiers in Psychology Corpus (2010–2021)

收藏
DataCite Commons2026-03-20 更新2026-05-04 收录
下载链接:
https://repod.icm.edu.pl/citation?persistentId=doi:10.18150/4LJ9WD
下载链接
链接失效反馈
官方服务:
资源简介:
Frontiers in Psychology Corpus (2010–2021)A comprehensive text corpus of 21,084 papers published in Frontiers in Psychology between 2010 and 2021, processed for computational linguistics and semantic analysis.Dataset OverviewThis corpus contains full-text articles from Frontiers in Psychology, converted from XML to plain text and preprocessed for natural language processing tasks.Files and Structurefpsyg_filtered.zipContains the filtered text corpus with the following preprocessing applied:- XML to text conversion: Original XML documents converted to plain text format- Sentence segmentation: Text segmented into individual sentences- Boilerplate removal: Journal metadata, headers, footers, and other non-content elements removed using filter_text_corpus.pyfpsyg_tagged.zip.* (2 parts)Contains the linguistically annotated corpus. This archive is split into multiple parts due to size constraints:- fpsyg_tagged.zip.001- fpsyg_tagged.zip.002To extract: First combine the parts using 7-Zip:bash7z x fpsyg_tagged.zip.0017-Zip will automatically detect and combine all parts, then extract the contents.After extraction:- File: fpsyg_filtered_tagged.conllu- POS tags: Penn Treebank part-of-speech tags- Dependency parsing: Universal Dependency (UD) tags- Tagger: Processed using Stanza- Format: CoNLL-U formatfpsyg_index.zip.* (6 parts)Contains a compiled binary index for semantic mining. This archive is split into multiple parts:- fpsyg_index.zip.001 through fpsyg_index.zip.006To extract: First combine the parts using 7-Zip:bash7z x fpsyg_index.zip.0017-Zip will automatically detect and combine all parts, then extract the contents.After extraction:- Compatible with ConceptSketch semantic mining software- Usage: Point ConceptSketch to the extraction directoryProcessing PipelineOriginal XML → Text Conversion → Sentence Segmentation → Boilerplate Removal → POS/UD Tagging (Stanza) → Final CorpusLicenseThis dataset is distributed under the Creative Commons Attribution License (CC BY), derived from the original papers' licensing terms.CitationIf you use this dataset in your research, please cite:Dataset compiled by: Marcin MiłkowskiAffiliation: Cognitive Metascience Lab & Center for AI in Society, Institute of Philosophy and Sociology, Polish Academy of SciencesRequirementsStanza (for re-tagging or custom processing): https://github.com/stanfordnlp/stanzaConceptSketch (for semantic mining): https://github.com/cognitive-metascience/concept-sketchContactFor questions regarding this dataset, please contact the Cognitive Metascience Lab at the Institute of Philosophy and Sociology, Polish Academy of Sciences: https://cognitive-metascience.github.io
提供机构:
RepOD
创建时间:
2026-03-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作