Norod78/cc100_heb

Name: Norod78/cc100_heb
Creator: Norod78
Published: 2024-07-14 12:50:08
License: 暂无描述

Hugging Face2024-07-14 更新2024-07-22 收录

下载链接：

https://hf-mirror.com/datasets/Norod78/cc100_heb

下载链接

链接失效反馈

官方服务：

资源简介：

CC-100 Hebrew数据集是CC-100数据集的希伯来语子集，以Parquet格式提供。该数据集包含文本生成和填充掩码任务所需的数据，主要用于自然语言处理领域。数据集的特征包括id和text，其中id为字符串类型，text也是字符串类型。数据集分为训练集，包含207,542,919个样本，总大小为37,346,904,255字节。

The Hebrew subset of CC-100 in Parquet format. This dataset is intended for text generation and fill-mask tasks, primarily used in the field of natural language processing. The features of the dataset include id and text, both of which are of string type. The dataset is divided into a training set containing 207,542,919 examples with a total size of 37,346,904,255 bytes.

提供机构：

Norod78

原始信息汇总

数据集概述

基本信息

数据集名称: CC-100 Hebrew
语言: 希伯来语 (he)
任务类别:
- 文本生成
- 填充掩码

数据结构

特征:
- id: 字符串类型
- text: 字符串类型

数据分割

训练集 (train):
- 样本数量: 207,542,919
- 数据大小: 37,346,904,255 字节

配置

默认配置 (default):
- 数据文件路径: data/train-*

数据大小

下载大小: 19,281,508,755 字节
数据集总大小: 37,346,904,255 字节

5,000+

优质数据集

54 个

任务类型

进入经典数据集