five

Raneechu/tokenized

收藏
Hugging Face2024-06-01 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/Raneechu/tokenized
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: default data_files: - split: train path: data/train-* - split: valid path: data/valid-* dataset_info: features: - name: input_ids sequence: int32 - name: attention_mask sequence: int8 - name: special_tokens_mask sequence: int8 splits: - name: train num_bytes: 7470870 num_examples: 304 - name: valid num_bytes: 1833594 num_examples: 75 download_size: 2706937 dataset_size: 9304464 --- # Dataset Card for "tokenized" [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

The dataset configuration is set to default, including train and valid data files. The dataset features include input_ids, attention_mask, and special_tokens_mask, all of which are sequence data. The dataset is divided into train and valid parts, containing 304 and 75 samples respectively. The download size of the dataset is 2706937 bytes, and the total size is 9304464 bytes.
提供机构:
Raneechu
原始信息汇总

数据集概述

配置信息

  • 默认配置 (config_name: default)
    • 训练数据 (split: train): 路径为 data/train-*
    • 验证数据 (split: valid): 路径为 data/valid-*

数据集特征

  • 输入ID (name: input_ids): 序列类型为 int32
  • 注意力掩码 (name: attention_mask): 序列类型为 int8
  • 特殊标记掩码 (name: special_tokens_mask): 序列类型为 int8

数据集分割

  • 训练集 (name: train)
    • 大小: 7470870 字节
    • 样本数: 304
  • 验证集 (name: valid)
    • 大小: 1833594 字节
    • 样本数: 75

数据集大小

  • 下载大小: 2706937 字节
  • 数据集总大小: 9304464 字节
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作