Norod78/hewiki-20220901-articles-dataset

Name: Norod78/hewiki-20220901-articles-dataset
Creator: Norod78
Published: 2022-11-22 10:57:40
License: 暂无描述

Hugging Face2022-11-22 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Norod78/hewiki-20220901-articles-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

数据集hewiki-20220901-articles-dataset来源于希伯来语维基百科（hewiki）2022年9月1日的文章数据。该数据集包含一个训练分割，共有4,325,836个样本，总大小为1,458,031,124字节。数据集的特征为一个名为text的字符串类型字段。该数据集主要用于文本生成和填充掩码任务，适用于语言建模和掩码语言建模任务。数据集的语言为希伯来语，属于单语种数据集。

The dataset hewiki-20220901-articles-dataset is derived from the article data of Hebrew Wikipedia (hewiki) as of September 1, 2022. This dataset includes one training split, with a total of 4,325,836 samples and an overall size of 1,458,031,124 bytes. The dataset features a string-type field named "text". It is primarily intended for text generation and mask filling tasks, and is suitable for language modeling and masked language modeling tasks. The dataset is in Hebrew and is a monolingual dataset.

提供机构：

Norod78

原始信息汇总

数据集概述

基本信息

数据集名称: hewiki-20220901-articles-dataset
数据集大小: 1458031124字节
下载大小: 745537027字节
训练集大小: 1458031124字节，包含4325836个样本

特征描述

特征名称: text
数据类型: string

语言与多语言性

语言: Hebrew (he)
多语言性: 单语种

数据集类别

大小类别: 100M<n<1B

来源与扩展

来源数据集: 扩展自Wikipedia

任务与应用

任务类别:
- 文本生成
- 填空任务
任务ID:
- 语言模型
- 掩码语言模型

数据集别名

别名: hewiki Corpus from hewiki-20220901-pages-articles-multistream.xml.bz2

5,000+

优质数据集

54 个

任务类型

进入经典数据集