mariemerenc/carolina_map

Name: mariemerenc/carolina_map
Creator: mariemerenc
Published: 2024-05-21 13:59:52
License: 暂无描述

Hugging Face2024-05-21 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/mariemerenc/carolina_map

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含多个特征，包括meta（字符串类型）、text（字符串类型）、input_ids（整数序列类型）、token_type_ids（8位整数序列类型）和attention_mask（8位整数序列类型）。数据集分为一个名为corpus的分割，包含19,902,943,368字节和2,107,045个样本。数据集的下载大小为5,809,034,651字节，实际大小为19,902,943,368字节。

The dataset includes multiple features such as meta (string type), text (string type), input_ids (integer sequence type), token_type_ids (8-bit integer sequence type), and attention_mask (8-bit integer sequence type). It is split into a segment named corpus containing 19,902,943,368 bytes and 2,107,045 examples. The download size of the dataset is 5,809,034,651 bytes, and the actual size is 19,902,943,368 bytes.

提供机构：

mariemerenc

原始信息汇总

数据集概述

数据集特征

meta: 数据类型为字符串。
text: 数据类型为字符串。
input_ids: 数据类型为int32序列。
token_type_ids: 数据类型为int8序列。
attention_mask: 数据类型为int8序列。

数据集划分

corpus:
- 数据量: 19,902,943,368字节
- 示例数量: 2,107,045

数据集大小

下载大小: 5,809,034,651字节
数据集总大小: 19,902,943,368字节

配置信息

config_name: default
data_files:
- split: corpus
- path: data/corpus-*

5,000+

优质数据集

54 个

任务类型

进入经典数据集