theojiang/CIVETv2_key_idea_retrieval_dataset_v3.1_gtebase_msmarco

Name: theojiang/CIVETv2_key_idea_retrieval_dataset_v3.1_gtebase_msmarco
Creator: theojiang
Published: 2024-11-21 21:43:58
License: 暂无描述

Hugging Face2024-11-21 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/theojiang/CIVETv2_key_idea_retrieval_dataset_v3.1_gtebase_msmarco

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含用于自然语言处理任务的文本数据，主要特征包括passage_input_ids、passage_attention_mask和question_embeddings。passage_input_ids和passage_attention_mask是序列类型，分别存储为int64和float32，而question_embeddings是一个嵌套序列，存储为float32。数据集分为训练集和验证集，训练集包含507990个样本，验证集包含500个样本。文件大小和下载大小也有详细说明。

This dataset contains text data for natural language processing tasks, with main features including passage_input_ids, passage_attention_mask, and question_embeddings. passage_input_ids and passage_attention_mask are sequence types, stored as int64 and float32 respectively, while question_embeddings is a nested sequence, stored as float32. The dataset is divided into a training set and a validation set, with the training set containing 507,990 samples and the validation set containing 500 samples. File sizes and download sizes are also detailed.

提供机构：

theojiang

5,000+

优质数据集

54 个

任务类型

进入经典数据集