botp/TigerResearch-pretrain_zh

Name: botp/TigerResearch-pretrain_zh
Creator: botp
Published: 2023-08-21 07:39:09
License: 暂无描述

Hugging Face2023-08-21 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/botp/TigerResearch-pretrain_zh

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是Tigerbot预训练数据的中文部分，包含未压缩前的中文书籍12G、中文互联网文本25G和中文百科19G。数据集的特征包括dataType、title、content、uniqueKey、titleUkey和id等字段。train分片包含16905023个示例，总大小为58043923125字节。

This dataset is the Chinese subset of Tigerbot's pre-training data. It contains 12 GB of uncompressed Chinese books, 25 GB of uncompressed Chinese internet text, and 19 GB of uncompressed Chinese encyclopedic data. The dataset includes fields such as dataType, title, content, uniqueKey, titleUkey, id, and others. The train split contains 16,905,023 examples, with a total size of 58,043,923,125 bytes.

提供机构：

botp

原始信息汇总

数据集概述

数据集信息

特征列表:
- dataType: 类型为字符串
- title: 类型为字符串
- content: 类型为字符串
- uniqueKey: 类型为字符串
- titleUkey: 类型为字符串
- id: 类型为整数64位
数据分割:
- train: 包含16,905,023个样本，总字节数为58,043,923,125字节
下载大小: 25,662,051,889字节
数据集大小: 58,043,923,125字节

数据来源

数据集来源于 TigerResearch/pretrain_zh

5,000+

优质数据集

54 个

任务类型

进入经典数据集