five

proxectonos/SciELO-GL

收藏
Hugging Face2025-12-19 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/proxectonos/SciELO-GL
下载链接
链接失效反馈
官方服务:
资源简介:
SciELO语料库是一个大规模平行资源,包含从《科学电子图书馆在线》(SciELO)中提取的完整科学文章。它提供了西班牙语、葡萄牙语和英语之间的高质量句子对,涵盖多个学术领域。该语料库特别适用于训练机器翻译模型,因为它提供了专业术语密度和复杂的语法结构,反映了拉丁美洲和伊比利亚地区真实的科学和技术语言使用情况。为了解决该领域缺乏公开可用的加利西亚语数据的问题,葡萄牙语部分通过转写和本地化工具转换为加利西亚语,并经过清理流程进行规范化。最终资源包含约30万条对齐句子,支持西班牙语-加利西亚语和英语-加利西亚语的语言对。

The **SciELO Corpus**, hosted in the[OPUS](https://opus.nlpl.eu/SciELO/corpus/version/SciELO) repository, is a large-scale parallel resource composed of full scientific articles extracted from the *Scientific Electronic Library Online (SciELO)*. It provides high-quality sentence pairs between Spanish, Portuguese, and English across diverse academic domains. This corpus is particularly valuable for training machine translation models, as it offers specialized terminology density and complex grammatical structures that reflect real scientific and technical language usage in Latin America and Iberia. To address the lack of publicly available Galician data in this domain, Portuguese segments were adapted into Galician using transliteration and localization tools found in our text [pipeline](https://github.com/proxectonos/pipeline) and [Apertium](https://github.com/apertium). The resulting texts were then normalized through our cleaning pipeline, ensuring consistency and readiness for model development. The final resource is a **parallel scientific corpus of ~300,000 aligned sentences** for the pairs **Spanish–Galician** and **English–Galician**.
提供机构:
proxectonos
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作