toramaru-u/cc100-ja-750

Name: toramaru-u/cc100-ja-750
Creator: toramaru-u
Published: 2024-07-12 13:30:26
License: 暂无描述

Hugging Face2024-07-12 更新2024-07-13 收录

下载链接：

https://hf-mirror.com/datasets/toramaru-u/cc100-ja-750

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含三种配置：默认配置、nsp配置和nsp-with-punctuation配置。默认配置包含一个名为text的字符串特征，主要用于文本数据的存储。nsp配置和nsp-with-punctuation配置包含idx、next_sentence_label、sentence_a和sentence_b四个特征，这些特征可能用于自然语言处理中的下一句预测任务。所有配置都只包含训练集，且数据量较大，适用于大规模机器学习模型的训练。

The dataset includes three configurations: default, nsp, and nsp-with-punctuation. The default configuration contains a string feature named text, primarily used for storing text data. The nsp and nsp-with-punctuation configurations include features idx, next_sentence_label, sentence_a, and sentence_b, which are likely used for next sentence prediction tasks in natural language processing. All configurations contain only training sets with large data volumes, suitable for training large-scale machine learning models.

提供机构：

toramaru-u

原始信息汇总

数据集概述

数据集配置

配置名称：default

特征：
- text：类型为 string
分割：
- train：包含 458,387,942 个样本，占用 75,695,613,009 字节
下载大小：44,914,752,651 字节
数据集大小：75,695,613,009 字节
数据文件路径：data/train-*

配置名称：nsp

特征：
- idx：类型为 int64
- next_sentence_label：类型为 int64
- sentence_a：类型为 string
- sentence_b：类型为 string
分割：
- train：包含 127,086,714 个样本，占用 31,149,226,287 字节
下载大小：19,812,891,928 字节
数据集大小：31,149,226,287 字节
数据文件路径：nsp/train-*

配置名称：nsp-with-punctuation

特征：
- idx：类型为 int64
- next_sentence_label：类型为 int64
- sentence_a：类型为 string
- sentence_b：类型为 string
分割：
- train：包含 127,758,778 个样本，占用 31,875,939,342 字节
下载大小：20,041,081,317 字节
数据集大小：31,875,939,342 字节
数据文件路径：nsp-with-punctuation/train-*

5,000+

优质数据集

54 个

任务类型

进入经典数据集