protonx-models/common_mix_law_400k_vit5_tokenized_huyen_pretrain

Name: protonx-models/common_mix_law_400k_vit5_tokenized_huyen_pretrain
Creator: protonx-models
Published: 2025-11-07 08:37:17
License: 暂无描述

Hugging Face2025-11-07 更新2025-11-15 收录

下载链接：

https://hf-mirror.com/datasets/protonx-models/common_mix_law_400k_vit5_tokenized_huyen_pretrain

下载链接

链接失效反馈

官方服务：

资源简介：

这个数据集包含了四个特征字段：文本(text)，输入ID序列(input_ids)，注意力掩码(attention_mask)和标签(labels)。文本字段是字符串类型，输入ID序列和注意力掩码是整数列表，标签是整数类型。数据集分为训练集和验证集，训练集包含639984个示例，大小为2226484955字节，验证集包含147966个示例，大小为531647539字节。数据集的总下载大小为1014701576字节，解压后的总大小为2758132494字节。

The dataset includes four feature fields: text, input_ids, attention_mask, and labels. The text field is of string type, the input_ids and attention_mask are lists of integers, and the labels are integers. The dataset is split into a training set and a validation set, with the training set containing 639984 examples and being 2226484955 bytes in size, and the validation set containing 147966 examples and being 531647539 bytes in size. The total download size of the dataset is 1014701576 bytes, and the total size after decompression is 2758132494 bytes.

提供机构：

protonx-models

5,000+

优质数据集

54 个

任务类型

进入经典数据集