Spanish Corpus XIX
收藏arXiv2025-09-30 收录
下载链接:
https://huggingface.co/datasets/Flaglab/spanish-corpus-xix
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个构建的语料库,包含了从1800年至1914年的古西班牙语文本,特别关注拉丁美洲的贡献。该语料库经过筛选和清洗,专门用于语义变迁检测任务。此外,该数据集还包括三个版本:原始版、清洗版和分块版,其中文本块不超过256个标记,以适应不同的语言模型。该数据集规模宏大,总标记数达到1300万。其任务是进行语义变迁检测(Semantic Shift Detection,简称Ssd)。
This dataset is a constructed corpus containing archaic Spanish texts spanning from 1800 to 1914, with special emphasis on Latin American contributions. The corpus has been filtered and cleaned specifically for semantic shift detection tasks. Additionally, the dataset provides three variants: the original version, the cleaned version, and the chunked version, where each text chunk contains no more than 256 tokens to accommodate different language models. This is a large-scale dataset with a total of 13 million tokens. The core task of this dataset is semantic shift detection (SSD, short for Semantic Shift Detection).
提供机构:
Flaglab



