Amirjalaly/books_fegh
收藏Hugging Face2024-02-26 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Amirjalaly/books_fegh
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: url
dtype: string
- name: language
dtype: string
- name: original_nlines
dtype: string
- name: part
dtype: string
- name: page
dtype: string
- name: nlines
dtype: string
- name: length
dtype: string
- name: title
dtype: string
- name: raw_content
dtype: string
- name: date_download
dtype: string
- name: language_score
dtype: string
- name: type
dtype: string
- name: perplexity
dtype: string
- name: original_length
dtype: string
- name: source_domain
dtype: string
splits:
- name: book_part1_fegh
num_bytes: 49004718
num_examples: 20000
- name: book_part2_fegh
num_bytes: 62933514
num_examples: 20000
- name: book_part3_fegh
num_bytes: 58078049
num_examples: 20000
- name: book_part4_fegh
num_bytes: 58591383
num_examples: 20000
- name: book_part5_fegh
num_bytes: 42504116
num_examples: 20000
- name: book_part6_fegh
num_bytes: 50998384
num_examples: 20000
- name: book_part7_fegh
num_bytes: 52735009
num_examples: 20000
- name: book_part8_fegh
num_bytes: 54972205
num_examples: 20000
- name: book_part9_fegh
num_bytes: 65020286
num_examples: 20000
- name: book_part10_fegh
num_bytes: 54380664
num_examples: 20000
- name: book_part11_fegh
num_bytes: 47427339
num_examples: 20000
- name: book_part12_fegh
num_bytes: 48398860
num_examples: 20000
- name: book_part13_fegh
num_bytes: 45573841
num_examples: 20000
- name: book_part14_fegh
num_bytes: 48445623
num_examples: 20000
- name: book_part15_fegh
num_bytes: 50559997
num_examples: 20000
- name: book_part16_fegh
num_bytes: 51662992
num_examples: 20000
- name: book_part17_fegh
num_bytes: 50755938
num_examples: 20000
- name: book_part18_fegh
num_bytes: 57893738
num_examples: 20000
- name: book_part19_fegh
num_bytes: 57818764
num_examples: 20000
- name: book_part20_fegh
num_bytes: 65119365
num_examples: 20000
- name: book_part21_fegh
num_bytes: 173500719
num_examples: 20000
- name: book_part22_fegh
num_bytes: 53707115
num_examples: 20000
- name: book_part23_fegh
num_bytes: 50702659
num_examples: 20000
- name: book_part24_fegh
num_bytes: 55158664
num_examples: 20000
- name: book_part25_fegh
num_bytes: 50015458
num_examples: 20000
- name: book_part26_fegh
num_bytes: 38386325
num_examples: 13982
download_size: 647726723
dataset_size: 1494345725
configs:
- config_name: default
data_files:
- split: book_part1_fegh
path: data/book_part1_fegh-*
- split: book_part2_fegh
path: data/book_part2_fegh-*
- split: book_part3_fegh
path: data/book_part3_fegh-*
- split: book_part4_fegh
path: data/book_part4_fegh-*
- split: book_part5_fegh
path: data/book_part5_fegh-*
- split: book_part6_fegh
path: data/book_part6_fegh-*
- split: book_part7_fegh
path: data/book_part7_fegh-*
- split: book_part8_fegh
path: data/book_part8_fegh-*
- split: book_part9_fegh
path: data/book_part9_fegh-*
- split: book_part10_fegh
path: data/book_part10_fegh-*
- split: book_part11_fegh
path: data/book_part11_fegh-*
- split: book_part12_fegh
path: data/book_part12_fegh-*
- split: book_part13_fegh
path: data/book_part13_fegh-*
- split: book_part14_fegh
path: data/book_part14_fegh-*
- split: book_part15_fegh
path: data/book_part15_fegh-*
- split: book_part16_fegh
path: data/book_part16_fegh-*
- split: book_part17_fegh
path: data/book_part17_fegh-*
- split: book_part18_fegh
path: data/book_part18_fegh-*
- split: book_part19_fegh
path: data/book_part19_fegh-*
- split: book_part20_fegh
path: data/book_part20_fegh-*
- split: book_part21_fegh
path: data/book_part21_fegh-*
- split: book_part22_fegh
path: data/book_part22_fegh-*
- split: book_part23_fegh
path: data/book_part23_fegh-*
- split: book_part24_fegh
path: data/book_part24_fegh-*
- split: book_part25_fegh
path: data/book_part25_fegh-*
- split: book_part26_fegh
path: data/book_part26_fegh-*
---
提供机构:
Amirjalaly
原始信息汇总
数据集概述
特征信息
数据集包含以下特征:
- url: 字符串类型
- language: 字符串类型
- original_nlines: 字符串类型
- part: 字符串类型
- page: 字符串类型
- nlines: 字符串类型
- length: 字符串类型
- title: 字符串类型
- raw_content: 字符串类型
- date_download: 字符串类型
- language_score: 字符串类型
- type: 字符串类型
- perplexity: 字符串类型
- original_length: 字符串类型
- source_domain: 字符串类型
数据分割
数据集分为多个部分,每个部分包含20,000个样本,除了最后一个部分包含13,982个样本。具体信息如下:
- book_part1_fegh: 49,004,718字节
- book_part2_fegh: 62,933,514字节
- book_part3_fegh: 58,078,049字节
- book_part4_fegh: 58,591,383字节
- book_part5_fegh: 42,504,116字节
- book_part6_fegh: 50,998,384字节
- book_part7_fegh: 52,735,009字节
- book_part8_fegh: 54,972,205字节
- book_part9_fegh: 65,020,286字节
- book_part10_fegh: 54,380,664字节
- book_part11_fegh: 47,427,339字节
- book_part12_fegh: 48,398,860字节
- book_part13_fegh: 45,573,841字节
- book_part14_fegh: 48,445,623字节
- book_part15_fegh: 50,559,997字节
- book_part16_fegh: 51,662,992字节
- book_part17_fegh: 50,755,938字节
- book_part18_fegh: 57,893,738字节
- book_part19_fegh: 57,818,764字节
- book_part20_fegh: 65,119,365字节
- book_part21_fegh: 173,500,719字节
- book_part22_fegh: 53,707,115字节
- book_part23_fegh: 50,702,659字节
- book_part24_fegh: 55,158,664字节
- book_part25_fegh: 50,015,458字节
- book_part26_fegh: 38,386,325字节
数据集大小
- 下载大小: 647,726,723字节
- 数据集大小: 1,494,345,725字节
配置信息
- config_name: default
- data_files: 每个部分的路径格式为
data/book_partX_fegh-*,其中X为部分编号。
- data_files: 每个部分的路径格式为



