akhooli/arabic-triplets-1m-curated-sims-len

Name: akhooli/arabic-triplets-1m-curated-sims-len
Creator: akhooli
Published: 2024-07-27 12:40:42
License: 暂无描述

Hugging Face2024-07-27 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/akhooli/arabic-triplets-1m-curated-sims-len

下载链接

链接失效反馈

官方服务：

资源简介：

这是一个用于阿拉伯语ColBERT和SBERT模型的精选数据集，包含1百万个样本。每个样本由anchor、positive和negative三部分组成，并提供了sim_pos和sim_neg两列，分别表示anchor与positive、negative之间的余弦相似度。此外，还包含了anchor、positive和negative的长度信息（以空格分隔的单词数）。数据集来源于mMARCO数据集和NLI数据集，经过筛选和合并，去除了包含拉丁字母的样本，并添加了相似度和长度信息。该数据集旨在帮助研究人员和用户根据多种标准（包括硬负样本）进行过滤。

This is a curated dataset for Arabic ColBERT and SBERT models, containing 1 million samples. Each sample consists of an anchor, a positive, and a negative, with additional columns sim_pos and sim_neg representing the cosine similarity between the anchor and the positive/negative examples. It also includes length information (number of words separated by spaces) for the anchor, positive, and negative. The dataset is derived from the mMARCO dataset and the NLI dataset, filtered and merged to remove samples containing Latin characters, and augmented with similarity and length information. This dataset aims to help researchers and users filter based on various criteria, including hard negatives.

提供机构：

akhooli

5,000+

优质数据集

54 个

任务类型

进入经典数据集