silma-ai/silma-arabic-triplets-dataset-v1.0

Name: silma-ai/silma-arabic-triplets-dataset-v1.0
Creator: silma-ai
Published: 2024-10-17 11:54:28
License: 暂无描述

Hugging Face2024-10-17 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/silma-ai/silma-arabic-triplets-dataset-v1.0

下载链接

链接失效反馈

官方服务：

资源简介：

SILMA Arabic Triplets Dataset - v1.0 是一个专门为阿拉伯语语义搜索任务设计的高质量、多样化的数据集。该数据集包含超过225万条记录，每条记录由三个部分组成：anchor（锚点）、positive（正例）和negative（负例），用于增强模型在语义相似性和差异性方面的学习能力。数据集由五个不同的子集组成，每个子集来自不同的领域，包括Akhooli、ArabicQuoraDuplicates、WikiMatrix、TedTalks和QnA。每个子集都有其独特的样本数量和领域背景。数据集的列包括anchor、positive、negative、source、anchor_len、positive_len和negative_len，分别表示锚点句子、正例句子、负例句子、数据来源以及各句子的长度。该数据集适用于训练语义搜索系统的嵌入模型、微调阿拉伯语文本相似性的语言模型以及评估基于嵌入的检索模型。

The SILMA Arabic Triplets Dataset - v1.0 is a high-quality, diverse dataset specifically curated for training and evaluating embedding models for semantic search tasks in Arabic. The dataset contains over 2.25 million records, structured as triplets consisting of an anchor, a positive sample (semantically similar), and a negative sample (semantically dissimilar). The dataset includes five unique splits from diverse domains, each with specific characteristics and sizes. The columns in the dataset include anchor, positive, negative, source, anchor_len, positive_len, and negative_len, providing comprehensive information for semantic similarity tasks. The dataset is suitable for training embeddings, fine-tuning language models, and evaluating retrieval models.

提供机构：

silma-ai

5,000+

优质数据集

54 个

任务类型

进入经典数据集