Raneechu/tokenized
收藏Hugging Face2024-06-01 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/Raneechu/tokenized
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: valid
path: data/valid-*
dataset_info:
features:
- name: input_ids
sequence: int32
- name: attention_mask
sequence: int8
- name: special_tokens_mask
sequence: int8
splits:
- name: train
num_bytes: 7470870
num_examples: 304
- name: valid
num_bytes: 1833594
num_examples: 75
download_size: 2706937
dataset_size: 9304464
---
# Dataset Card for "tokenized"
[More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
The dataset configuration is set to default, including train and valid data files. The dataset features include input_ids, attention_mask, and special_tokens_mask, all of which are sequence data. The dataset is divided into train and valid parts, containing 304 and 75 samples respectively. The download size of the dataset is 2706937 bytes, and the total size is 9304464 bytes.
提供机构:
Raneechu
原始信息汇总
数据集概述
配置信息
- 默认配置 (
config_name: default)- 训练数据 (
split: train): 路径为data/train-* - 验证数据 (
split: valid): 路径为data/valid-*
- 训练数据 (
数据集特征
- 输入ID (
name: input_ids): 序列类型为int32 - 注意力掩码 (
name: attention_mask): 序列类型为int8 - 特殊标记掩码 (
name: special_tokens_mask): 序列类型为int8
数据集分割
- 训练集 (
name: train)- 大小: 7470870 字节
- 样本数: 304
- 验证集 (
name: valid)- 大小: 1833594 字节
- 样本数: 75
数据集大小
- 下载大小: 2706937 字节
- 数据集总大小: 9304464 字节



