amuvarma/contentonly-proc-train-1m-1dups

Name: amuvarma/contentonly-proc-train-1m-1dups
Creator: amuvarma
Published: 2024-11-26 23:18:23
License: 暂无描述

Hugging Face2024-11-26 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/amuvarma/contentonly-proc-train-1m-1dups

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含四个主要特征：文本转录（transcript）、序列化的int64数据（facodec_1）、序列化的int64数据（tokenised_text）和序列化的int32数据（input_ids）。数据集分为一个训练集（train），包含1,000,000个样本，总大小为11,657,305,372字节，下载大小为3,034,608,223字节。

The dataset includes four features: transcript (string), facodec_1 (integer sequence), tokenised_text (integer sequence), and input_ids (integer sequence). The dataset is split into a training set with 1,000,000 examples, and the total size of the dataset is 11,657,305,372 bytes.

提供机构：

amuvarma

5,000+

优质数据集

54 个

任务类型

进入经典数据集