esCorpius
收藏arXiv2022-07-01 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/2206.15147v2
下载链接
链接失效反馈官方服务:
资源简介:
esCorpius是一个从近1 Pb的Common Crawl数据中提取的西班牙语爬虫语料库,是目前西班牙语中最大且质量最高的语料库,涉及文本内容的提取、净化和去重。数据精选过程包括一个新颖的高并行清洁管道和一系列去重机制,确保文档和段落边界的完整性。
esCorpius is a Spanish web-crawled corpus extracted from nearly 1 petabyte (Pb) of Common Crawl data. It stands as the largest and highest-quality Spanish-language corpus to date, with its development encompassing text extraction, purification, and deduplication. The data curation workflow integrates a novel high-parallelism cleaning pipeline and a suite of deduplication mechanisms to ensure the integrity of document and paragraph boundaries.
创建时间:
2022-06-30



