Raneechu/tokenized

Name: Raneechu/tokenized
Creator: Raneechu
Published: 2024-06-01 06:29:56
License: 暂无描述

Hugging Face2024-06-01 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/Raneechu/tokenized

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: default data_files: - split: train path: data/train-* - split: valid path: data/valid-* dataset_info: features: - name: input_ids sequence: int32 - name: attention_mask sequence: int8 - name: special_tokens_mask sequence: int8 splits: - name: train num_bytes: 7470870 num_examples: 304 - name: valid num_bytes: 1833594 num_examples: 75 download_size: 2706937 dataset_size: 9304464 --- # Dataset Card for "tokenized" [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

The dataset configuration is set to default, including train and valid data files. The dataset features include input_ids, attention_mask, and special_tokens_mask, all of which are sequence data. The dataset is divided into train and valid parts, containing 304 and 75 samples respectively. The download size of the dataset is 2706937 bytes, and the total size is 9304464 bytes.

提供机构：

Raneechu

原始信息汇总

数据集概述

配置信息

默认配置 (config_name: default)
- 训练数据 (split: train): 路径为 data/train-*
- 验证数据 (split: valid): 路径为 data/valid-*

数据集特征

输入ID (name: input_ids): 序列类型为 int32
注意力掩码 (name: attention_mask): 序列类型为 int8
特殊标记掩码 (name: special_tokens_mask): 序列类型为 int8

数据集分割

训练集 (name: train)
- 大小: 7470870 字节
- 样本数: 304
验证集 (name: valid)
- 大小: 1833594 字节
- 样本数: 75

数据集大小

下载大小: 2706937 字节
数据集总大小: 9304464 字节

5,000+

优质数据集

54 个

任务类型

进入经典数据集