toramaru-u/cc100-ja-750
收藏Hugging Face2024-07-12 更新2024-07-13 收录
下载链接:
https://hf-mirror.com/datasets/toramaru-u/cc100-ja-750
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含三种配置:默认配置、nsp配置和nsp-with-punctuation配置。默认配置包含一个名为text的字符串特征,主要用于文本数据的存储。nsp配置和nsp-with-punctuation配置包含idx、next_sentence_label、sentence_a和sentence_b四个特征,这些特征可能用于自然语言处理中的下一句预测任务。所有配置都只包含训练集,且数据量较大,适用于大规模机器学习模型的训练。
The dataset includes three configurations: default, nsp, and nsp-with-punctuation. The default configuration contains a string feature named text, primarily used for storing text data. The nsp and nsp-with-punctuation configurations include features idx, next_sentence_label, sentence_a, and sentence_b, which are likely used for next sentence prediction tasks in natural language processing. All configurations contain only training sets with large data volumes, suitable for training large-scale machine learning models.
提供机构:
toramaru-u
原始信息汇总
数据集概述
数据集配置
配置名称:default
- 特征:
text:类型为string
- 分割:
train:包含 458,387,942 个样本,占用 75,695,613,009 字节
- 下载大小:44,914,752,651 字节
- 数据集大小:75,695,613,009 字节
- 数据文件路径:
data/train-*
配置名称:nsp
- 特征:
idx:类型为int64next_sentence_label:类型为int64sentence_a:类型为stringsentence_b:类型为string
- 分割:
train:包含 127,086,714 个样本,占用 31,149,226,287 字节
- 下载大小:19,812,891,928 字节
- 数据集大小:31,149,226,287 字节
- 数据文件路径:
nsp/train-*
配置名称:nsp-with-punctuation
- 特征:
idx:类型为int64next_sentence_label:类型为int64sentence_a:类型为stringsentence_b:类型为string
- 分割:
train:包含 127,758,778 个样本,占用 31,875,939,342 字节
- 下载大小:20,041,081,317 字节
- 数据集大小:31,875,939,342 字节
- 数据文件路径:
nsp-with-punctuation/train-*



