five

akhooli/arabic-triplets-1m-curated-sims-len

收藏
Hugging Face2024-07-27 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/akhooli/arabic-triplets-1m-curated-sims-len
下载链接
链接失效反馈
官方服务:
资源简介:
这是一个用于阿拉伯语ColBERT和SBERT模型的精选数据集,包含1百万个样本。每个样本由anchor、positive和negative三部分组成,并提供了sim_pos和sim_neg两列,分别表示anchor与positive、negative之间的余弦相似度。此外,还包含了anchor、positive和negative的长度信息(以空格分隔的单词数)。数据集来源于mMARCO数据集和NLI数据集,经过筛选和合并,去除了包含拉丁字母的样本,并添加了相似度和长度信息。该数据集旨在帮助研究人员和用户根据多种标准(包括硬负样本)进行过滤。

This is a curated dataset for Arabic ColBERT and SBERT models, containing 1 million samples. Each sample consists of an anchor, a positive, and a negative, with additional columns sim_pos and sim_neg representing the cosine similarity between the anchor and the positive/negative examples. It also includes length information (number of words separated by spaces) for the anchor, positive, and negative. The dataset is derived from the mMARCO dataset and the NLI dataset, filtered and merged to remove samples containing Latin characters, and augmented with similarity and length information. This dataset aims to help researchers and users filter based on various criteria, including hard negatives.
提供机构:
akhooli
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作