TUKE-KEMT/slovak-triplets

Name: TUKE-KEMT/slovak-triplets
Creator: TUKE-KEMT
Published: 2025-12-22 18:43:09
License: 暂无描述

Hugging Face2025-12-22 更新2026-01-03 收录

下载链接：

https://hf-mirror.com/datasets/TUKE-KEMT/slovak-triplets

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是斯洛伐克语的三元组句子集合，专门用于训练和评估文档嵌入模型。每个三元组包含：一个锚句子、一个相似的正例句子和一个不相似的负例句子。数据来源于WebFAQ和MQA公开数据集的问题-答案对，通过特定流程构建：从语料库提取各领域问题及答案，将对应问题答案作为正例，从同领域不同问题中选择答案作为负例，最终形成（锚问题，正例答案，负例答案）的三元组结构。数据集共包含967,317个三元组，但存在未清洗、可能重复、部分答案含Markdown格式等问题。

This repository contains the Slovak Triplets Dataset, a collection of triplet sentences in the Slovak language designed for training and evaluating document embedding models. Each triplet consists of an anchor sentence, a positive sentence (similar to the anchor), and a negative sentence (dissimilar to the anchor). The dataset is extracted from the Slovak part of WebFAQ and MQA datasets, which are publicly available collections of question-answer pairs. The creation process involves: extracting Q&A pairs per domain, identifying corresponding answers as positives, selecting negatives from different questions in the same domain, and storing the resulting (anchor question, positive answer, negative answer) triplets. There are 967,317 triplets total, with noted issues including uncleaned data, possible duplicates, and Markdown formatting in some answers.

提供机构：

TUKE-KEMT

5,000+

优质数据集

54 个

任务类型

进入经典数据集