Arailym-tleubayeva/KazakhTextDuplicatesv2.0
收藏Hugging Face2025-12-12 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Arailym-tleubayeva/KazakhTextDuplicatesv2.0
下载链接
链接失效反馈官方服务:
资源简介:
KazakhTextDuplicates v2.0是一个用于哈萨克语的大规模数据集,主要用于重复检测、近重复检索、语义文本相似性(STS)和抄袭检测。版本2.0扩展了数据集,包括一个大型增强训练语料库(200K+对)、连续的语义相似性评分(similarity_score)、多个难度的噪声重复以及干净的训练/验证/测试分割(无标识符重叠)。该数据集旨在为低资源语言训练和评估现代句子嵌入和检索模型。数据集包含207,376对文本,分为训练集(146,072对)、验证集(16,231对)和测试集(45,073对),每对文本都有连续的相似性评分(0.40–1.00)和多种重复类型(如完全重复、噪声软重复、释义、噪声中等重复、上下文重复、噪声硬重复和部分重复)。数据集结构包括原始文本、修改后的文本、重复类型、相似性评分、文本域/来源、语言代码、文本长度和分割信息。
KazakhTextDuplicates v2.0 is a large-scale dataset for duplicate detection, near-duplicate retrieval, semantic textual similarity (STS), and plagiarism detection in the Kazakh language. Version 2.0 significantly extends the dataset with a large augmented training corpus (200K+ pairs), a continuous semantic similarity score (similarity_score), multiple difficulty levels of noisy duplicates, and a clean train/validation/test split without identifier overlap. The dataset is designed for training and evaluating modern sentence embedding and retrieval models for low-resource languages. The dataset contains 207,376 text pairs, divided into train (146,072 pairs), validation (16,231 pairs), and test (45,073 pairs) sets, each with a continuous similarity score (0.40–1.00) and multiple types of duplicates (e.g., exact, noisy_soft, paraphrase, noisy_medium, contextual, noisy_hard, and partial). The dataset structure includes original text, modified text, type of duplicate, similarity score, text domain/source, language code, text length, and split information.
提供机构:
Arailym-tleubayeva



