NorskHelsenett/eti-embedding-training-data-2048-triplets
收藏Hugging Face2026-04-29 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/NorskHelsenett/eti-embedding-training-data-2048-triplets
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含330,120个(锚点、正例、负例)三元组,用于训练和微调挪威语嵌入模型,特别是用于健康相关检索和RAG应用。三元组是从源数据集[NorskHelsenett/eti-embedding-training-data-2048](https://huggingface.co/datasets/NorskHelsenett/eti-embedding-training-data-2048)中挖掘的,该数据集包含78,888个挪威健康内容的锚点-正例对。负例是使用Sentence Transformers的`mine_hard_negatives()`方法通过FAISS基于近似最近邻搜索挖掘的。数据集格式为每行包含三个文本字段:锚点(挪威语问题)、正例(正确/相关段落)和负例(相关但不正确的段落)。数据集旨在用于使用三元组损失或类似对比目标微调嵌入模型,训练挪威健康RAG系统中的交叉编码器进行重新排序,以及通过教模型区分相似但不同的健康主题来提高检索质量。
This dataset contains 330,120 (anchor, positive, negative) triplets for training and fine-tuning Norwegian-language embedding models, particularly for health-related retrieval and RAG applications. The triplets were mined from the source dataset [NorskHelsenett/eti-embedding-training-data-2048](https://huggingface.co/datasets/NorskHelsenett/eti-embedding-training-data-2048), which contains 78,888 anchor-positive pairs of Norwegian health content. Hard negatives were mined using Sentence Transformers `mine_hard_negatives()` with FAISS-based approximate nearest neighbor search. The dataset format is each row contains three text fields: anchor (a question in Norwegian), positive (the correct/relevant passage), and negative (a hard negative — related but incorrect passage). The dataset is designed for fine-tuning embedding models using triplet loss or similar contrastive objectives, training cross-encoders for re-ranking in Norwegian health RAG systems, and improving retrieval quality by teaching models to distinguish between similar-but-different health topics.
提供机构:
NorskHelsenett



