silma-ai/silma-arabic-triplets-dataset-v1.0
收藏Hugging Face2024-10-17 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/silma-ai/silma-arabic-triplets-dataset-v1.0
下载链接
链接失效反馈官方服务:
资源简介:
SILMA Arabic Triplets Dataset - v1.0 是一个专门为阿拉伯语语义搜索任务设计的高质量、多样化的数据集。该数据集包含超过225万条记录,每条记录由三个部分组成:anchor(锚点)、positive(正例)和negative(负例),用于增强模型在语义相似性和差异性方面的学习能力。数据集由五个不同的子集组成,每个子集来自不同的领域,包括Akhooli、ArabicQuoraDuplicates、WikiMatrix、TedTalks和QnA。每个子集都有其独特的样本数量和领域背景。数据集的列包括anchor、positive、negative、source、anchor_len、positive_len和negative_len,分别表示锚点句子、正例句子、负例句子、数据来源以及各句子的长度。该数据集适用于训练语义搜索系统的嵌入模型、微调阿拉伯语文本相似性的语言模型以及评估基于嵌入的检索模型。
The SILMA Arabic Triplets Dataset - v1.0 is a high-quality, diverse dataset specifically curated for training and evaluating embedding models for semantic search tasks in Arabic. The dataset contains over 2.25 million records, structured as triplets consisting of an anchor, a positive sample (semantically similar), and a negative sample (semantically dissimilar). The dataset includes five unique splits from diverse domains, each with specific characteristics and sizes. The columns in the dataset include anchor, positive, negative, source, anchor_len, positive_len, and negative_len, providing comprehensive information for semantic similarity tasks. The dataset is suitable for training embeddings, fine-tuning language models, and evaluating retrieval models.
提供机构:
silma-ai



