TemryL/tokenized_wikipedia_20220301.en_train_512

Name: TemryL/tokenized_wikipedia_20220301.en_train_512
Creator: TemryL
Published: 2024-08-03 19:40:20
License: 暂无描述

Hugging Face2024-08-03 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/TemryL/tokenized_wikipedia_20220301.en_train_512

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含来自2022年3月1日英文维基百科转储的分词文本块。每个条目代表维基百科中的一个文本块，包含有关该块来自哪个文档及其在文档中的位置的信息。数据集使用BERT基础未分词器进行处理，每个文本块包含512个令牌（包括特殊令牌）。处理步骤包括加载原始数据集、分词、创建文本块、生成注意力掩码，并保留文档和块索引信息。数据集的结构包括token_ids、attention_mask、doc_idx和chunk_idx四个字段。

This dataset contains tokenized chunks of text from the English Wikipedia dump of March 1, 2022. Each entry in the dataset represents a chunk of text from Wikipedia, with information about which document and position within the document it comes from. The dataset is processed using the BERT base uncased tokenizer, with each chunk containing exactly 512 tokens (including special tokens). The processing steps include loading the raw dataset, tokenizing, creating chunks, generating attention masks, and preserving document and chunk index information. The dataset structure includes fields for token_ids, attention_mask, doc_idx, and chunk_idx.

提供机构：

TemryL

5,000+

优质数据集

54 个

任务类型

进入经典数据集