codefuse-ai/F2LLM
收藏Hugging Face2025-10-06 更新2025-10-18 收录
下载链接:
https://hf-mirror.com/datasets/codefuse-ai/F2LLM
下载链接
链接失效反馈官方服务:
资源简介:
F2LLM数据集包含了600万个从开源非合成数据中精心挑选的查询-文档-负样本三元组,用于训练嵌入模型,数据集分为检索、分类和聚类三种类型,每个检索和聚类数据样本附带24个硬负样本,每个分类数据样本附带1个硬负样本。
The F2LLM dataset includes 6 million query-document-negative tuples curated solely from open-source, non-synthetic data, serving as a strong, budget-friendly baseline for training embedding models. The dataset is divided into three types: retrieval, classification, and clustering, with each retrieval and clustering data sample accompanied by 24 hard negatives, and each classification data sample accompanied by 1 hard negative.
提供机构:
codefuse-ai



