TemryL/tokenized_wikipedia_20220301.en_train_128

Name: TemryL/tokenized_wikipedia_20220301.en_train_128
Creator: TemryL
Published: 2024-08-03 20:32:49
License: 暂无描述

Hugging Face2024-08-03 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/TemryL/tokenized_wikipedia_20220301.en_train_128

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含从2022年3月1日的英文维基百科转储中提取的文本块，每个文本块经过BERT base uncased分词器处理，并被分割成128个token的块。每个块包含token IDs、attention mask、文档索引和块索引等信息。数据集的创建过程包括加载原始维基百科数据集、使用BERT分词器进行分词、将分词后的文章处理成128个token的块，并保留每个块的文档索引和块索引。

This dataset contains tokenized chunks of text from the English Wikipedia dump of March 1, 2022. Each entry in the dataset represents a chunk of text from Wikipedia, with information about which document and position within the document it comes from. The dataset creation process involves loading the raw Wikipedia dataset, tokenizing articles using the BERT base uncased tokenizer, processing the tokenized articles to create chunks of exactly 128 tokens each, and preserving document index and chunk index information. Each row in the dataset corresponds to a chunk of text from a Wikipedia document and contains fields such as token_ids, attention_mask, doc_idx, and chunk_idx.

提供机构：

TemryL

5,000+

优质数据集

54 个

任务类型

进入经典数据集