toramaru-u/cc100-ja-1024

Name: toramaru-u/cc100-ja-1024
Creator: toramaru-u
Published: 2024-07-17 00:25:44
License: 暂无描述

Hugging Face2024-07-17 更新2024-07-13 收录

下载链接：

https://hf-mirror.com/datasets/toramaru-u/cc100-ja-1024

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含多个配置，每个配置针对不同的自然语言处理任务。default配置包含文本数据，而nsp配置及其变体（如nsp-20240716和nsp-with-punctuation）包含用于下一句预测任务的数据，其中包括句子对和标签。这些数据集的大小和下载大小各不相同，适用于大规模的自然语言处理模型训练。

This dataset includes multiple configurations, each tailored for different natural language processing tasks. The default configuration contains text data, while the nsp configuration and its variants (such as nsp-20240716 and nsp-with-punctuation) contain data for next sentence prediction tasks, including sentence pairs and labels. The sizes and download sizes of these datasets vary, making them suitable for large-scale natural language processing model training.

提供机构：

toramaru-u

原始信息汇总

数据集概述

配置信息

默认配置 (`default`)

特征:
- text: 类型为 string
分割:
- train: 包含 458,387,942 个样本，占用 75,695,613,009 字节
下载大小: 44,915,133,864 字节
数据集大小: 75,695,613,009 字节

NSP 配置 (`nsp`)

特征:
- idx: 类型为 int64
- next_sentence_label: 类型为 int64
- sentence_a: 类型为 string
- sentence_b: 类型为 string
分割:
- train: 包含 127,086,714 个样本，占用 31,149,226,287 字节
下载大小: 19,813,155,017 字节
数据集大小: 31,149,226,287 字节

NSP 20240716 配置 (`nsp-20240716`)

特征:
- idx: 类型为 int64
- next_sentence_label: 类型为 int64
- sentence_a: 类型为 string
- sentence_b: 类型为 string
分割:
- train: 包含 127,225,260 个样本，占用 31,853,006,444 字节
下载大小: 19,999,759,727 字节
数据集大小: 31,853,006,444 字节

NSP 带标点配置 (`nsp-with-punctuation`)

特征:
- idx: 类型为 int64
- next_sentence_label: 类型为 int64
- sentence_a: 类型为 string
- sentence_b: 类型为 string
分割:
- train: 包含 127,758,778 个样本，占用 31,875,939,342 字节
下载大小: 20,041,130,298 字节
数据集大小: 31,875,939,342 字节

数据文件路径

默认配置 (default):
- train: data/train-*
NSP 配置 (nsp):
- train: nsp/train-*
NSP 20240716 配置 (nsp-20240716):
- train: nsp-20240716/train-*
NSP 带标点配置 (nsp-with-punctuation):
- train: nsp-with-punctuation/train-*

5,000+

优质数据集

54 个

任务类型

进入经典数据集

toramaru-u/cc100-ja-1024

数据集概述

配置信息

默认配置 (default)

NSP 配置 (nsp)

NSP 20240716 配置 (nsp-20240716)

NSP 带标点配置 (nsp-with-punctuation)

数据文件路径

默认配置 (`default`)

NSP 配置 (`nsp`)

NSP 20240716 配置 (`nsp-20240716`)

NSP 带标点配置 (`nsp-with-punctuation`)