five

KaLM-Embedding/KaLM-embedding-finetuning-data-spanish

收藏
Hugging Face2026-04-22 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/KaLM-Embedding/KaLM-embedding-finetuning-data-spanish
下载链接
链接失效反馈
官方服务:
资源简介:
KaLM-embedding-finetuning-data-spanish是一个用于嵌入模型微调的西班牙语数据集,改编自上游数据集。该数据集保持了与上游版本相同的面向训练的三元组/列表结构,并组织为多个可独立加载或组合用于大规模嵌入训练的parquet支持的子集。数据集包含85个子集,共162个parquet分片,本地磁盘使用约30G。每个样本遵循相同的嵌入微调格式:query(字符串,每个样本一个查询),pos(字符串列表,通常包含一个正例),neg(字符串列表,通常包含七个负例)。数据集适用于检索式嵌入微调、查询-文档对比学习、句子相似性/STS风格训练以及西班牙语数据的多语言或跨语言嵌入适应。

KaLM-embedding-finetuning-data-spanish is a Spanish finetuning dataset for embedding models, adapted from the upstream dataset. It maintains the same training-oriented triplet/list structure as the upstream release and is organized into multiple parquet-backed subsets that can be loaded independently or combined for large-scale embedding training. The dataset includes 85 subsets with 162 parquet shards, occupying about 30G of local disk space. Each sample follows the same embedding finetuning format: query (string, one query per sample), pos (list[string], usually containing one positive example), neg (list[string], usually containing seven negative examples). The dataset is suitable for retrieval-style embedding finetuning, query-document contrastive learning, sentence similarity / STS-style training, and multilingual or cross-lingual embedding adaptation with Spanish data.
提供机构:
KaLM-Embedding
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作