timpal0l/trafilatura-extracted-full-txt
收藏Hugging Face2024-08-10 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/timpal0l/trafilatura-extracted-full-txt
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含两个主要分割:训练集和测试集。训练集包含6,928,615个示例,占用184,550,589,140字节;测试集包含769,847个示例,占用20,505,644,692字节。数据集的特征包括input_ids(int32序列)、attention_mask(int8序列)和labels(int64序列)。数据集的总大小为205,056,233,832字节,下载大小为62,121,422,076字节。
The dataset includes two main splits: train and test. The train split contains 6,928,615 examples, occupying 184,550,589,140 bytes; the test split contains 769,847 examples, occupying 20,505,644,692 bytes. The features of the dataset include input_ids (int32 sequence), attention_mask (int8 sequence), and labels (int64 sequence). The total size of the dataset is 205,056,233,832 bytes, with a download size of 62,121,422,076 bytes.
提供机构:
timpal0l



