five

CO.PRE.PAN Full Corpus (Restricted)

收藏
Zenodo2026-02-23 更新2026-05-26 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.18740954
下载链接
链接失效反馈
官方服务:
资源简介:
This record contains the complete CO.PRE.PAN (Corpus de Prensa Panhispánico) press corpus, organized into country-specific ZIP archives with linguistically annotated JSON files. Due to copyright restrictions, all texts and annotations are distributed under restricted access and cannot be shared openly. Users may request access directly through Zenodo. Contents of this record Each {COUNTRYCODE}.zip archive contains: Press texts in plain text format (txt-files/) Annotated JSON files (json-annotated/) All ZIP archives were generated using the internal script "zenodo_corpus_zip.py", which automatically tracks timestamps and file changes to ensure reproducible versioning. Corpus description CO.PRE.PAN is a cross-national corpus of written press Spanish from 18 Spanish-speaking countries, comprising over 14 million words. It is structurally aligned with the spoken broadcast corpus CO.RA.PAN (Corpus Radiofónico Panhispánico) and serves as a scripted register baseline for comparative analyses of national standard varieties of Spanish. All texts are drawn from comparable press genres and produced under broadly equivalent publication conditions across countries, ensuring cross-national comparability. Versioning Each version of this record represents a coherent snapshot of the full corpus at a specific point in time. Updates may include newly added texts, corrected or extended annotations, and improvements to preprocessing and linguistic annotation. Annotation details Each JSON file contains: tokenization, sentence segmentation POS tags, lemmas, and morphological features dependency relations automatic categorization of verbal tense and related features All annotations are generated using spaCy (model: es_dep_news_trf), followed by project-specific quality control steps, using the same annotation pipeline applied to CO.RA.PAN. Legal and access information The restricted status of this record is due to copyright limitations. Only short text extracts may be displayed publicly under scientific quotation rules and text-and-data-mining provisions of EU Directive 2019/790 and the German UrhG (§51, §60d, §44b). Redistribution or reuse of the full texts and annotations is not permitted. Access requests can be submitted directly through Zenodo. For scientific inquiries or technical questions, please contact the CO.PRE.PAN project team.
提供机构:
Zenodo
创建时间:
2026-02-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作