HiTZ/latxa-corpus-v2
收藏Hugging Face2026-02-13 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/HiTZ/latxa-corpus-v2
下载链接
链接失效反馈官方服务:
资源简介:
Latxa Corpus v2是一个大规模的单语巴斯克语语料库,由HiTZ研究中心和IXA研究小组(巴斯克大学UPV/EHU)策划。该语料库结合了经过整理的爬取数据、公共数据集、机构数据和新收集的资源,与v1.1版本相比,显著增加了覆盖范围、多样性和数据量。最终语料库经过去重和过滤,适用于语言模型预训练。数据来源包括Euscrawl v2、Egunkaria日报、Booktegi电子书、ZelaiHandi语料库子集、巴斯克政府官方公报、吉普斯夸省议会官方公报、Álava省议会官方公报、巴斯克议会会议记录、巴斯克学术期刊、巴斯克维基百科、CulturaX语料库的巴斯克部分、Colossal OSCAR语料库的巴斯克部分、FineWeb2语料库的巴斯克部分、FinePDFs语料库的巴斯克部分、HPLT v1和v2语料库的巴斯克部分以及巴斯克语字幕。
Latxa Corpus v2 is a large-scale monolingual Basque corpus, created by combining curated crawls, public datasets, institutional data, and newly collected resources. Compared to v1.1, it substantially increases coverage, diversity, and volume. The final corpus is deduplicated, filtered, and ready for language model pretraining. Data sources include Euscrawl v2, Egunkaria daily newspaper, Booktegi EPUB books, a subset of the ZelaiHandi corpus, the official gazette of the Basque Government, the official gazette of the Provincial Council of Gipuzkoa, the official gazette of the Provincial Council of Álava, transcriptions of Basque Parliament sessions, Basque academic journals, Basque Wikipedia, the Basque portion of the CulturaX corpus, the Basque portion of the Colossal OSCAR corpus, the Basque portion of the FineWeb2 corpus, the Basque portion of the FinePDFs corpus, the Basque portion of the HPLT v1 and v2 corpora, and subtitles in Basque.
提供机构:
HiTZ



