five

Amirjalaly/books_fegh

收藏
Hugging Face2024-02-26 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Amirjalaly/books_fegh
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: url dtype: string - name: language dtype: string - name: original_nlines dtype: string - name: part dtype: string - name: page dtype: string - name: nlines dtype: string - name: length dtype: string - name: title dtype: string - name: raw_content dtype: string - name: date_download dtype: string - name: language_score dtype: string - name: type dtype: string - name: perplexity dtype: string - name: original_length dtype: string - name: source_domain dtype: string splits: - name: book_part1_fegh num_bytes: 49004718 num_examples: 20000 - name: book_part2_fegh num_bytes: 62933514 num_examples: 20000 - name: book_part3_fegh num_bytes: 58078049 num_examples: 20000 - name: book_part4_fegh num_bytes: 58591383 num_examples: 20000 - name: book_part5_fegh num_bytes: 42504116 num_examples: 20000 - name: book_part6_fegh num_bytes: 50998384 num_examples: 20000 - name: book_part7_fegh num_bytes: 52735009 num_examples: 20000 - name: book_part8_fegh num_bytes: 54972205 num_examples: 20000 - name: book_part9_fegh num_bytes: 65020286 num_examples: 20000 - name: book_part10_fegh num_bytes: 54380664 num_examples: 20000 - name: book_part11_fegh num_bytes: 47427339 num_examples: 20000 - name: book_part12_fegh num_bytes: 48398860 num_examples: 20000 - name: book_part13_fegh num_bytes: 45573841 num_examples: 20000 - name: book_part14_fegh num_bytes: 48445623 num_examples: 20000 - name: book_part15_fegh num_bytes: 50559997 num_examples: 20000 - name: book_part16_fegh num_bytes: 51662992 num_examples: 20000 - name: book_part17_fegh num_bytes: 50755938 num_examples: 20000 - name: book_part18_fegh num_bytes: 57893738 num_examples: 20000 - name: book_part19_fegh num_bytes: 57818764 num_examples: 20000 - name: book_part20_fegh num_bytes: 65119365 num_examples: 20000 - name: book_part21_fegh num_bytes: 173500719 num_examples: 20000 - name: book_part22_fegh num_bytes: 53707115 num_examples: 20000 - name: book_part23_fegh num_bytes: 50702659 num_examples: 20000 - name: book_part24_fegh num_bytes: 55158664 num_examples: 20000 - name: book_part25_fegh num_bytes: 50015458 num_examples: 20000 - name: book_part26_fegh num_bytes: 38386325 num_examples: 13982 download_size: 647726723 dataset_size: 1494345725 configs: - config_name: default data_files: - split: book_part1_fegh path: data/book_part1_fegh-* - split: book_part2_fegh path: data/book_part2_fegh-* - split: book_part3_fegh path: data/book_part3_fegh-* - split: book_part4_fegh path: data/book_part4_fegh-* - split: book_part5_fegh path: data/book_part5_fegh-* - split: book_part6_fegh path: data/book_part6_fegh-* - split: book_part7_fegh path: data/book_part7_fegh-* - split: book_part8_fegh path: data/book_part8_fegh-* - split: book_part9_fegh path: data/book_part9_fegh-* - split: book_part10_fegh path: data/book_part10_fegh-* - split: book_part11_fegh path: data/book_part11_fegh-* - split: book_part12_fegh path: data/book_part12_fegh-* - split: book_part13_fegh path: data/book_part13_fegh-* - split: book_part14_fegh path: data/book_part14_fegh-* - split: book_part15_fegh path: data/book_part15_fegh-* - split: book_part16_fegh path: data/book_part16_fegh-* - split: book_part17_fegh path: data/book_part17_fegh-* - split: book_part18_fegh path: data/book_part18_fegh-* - split: book_part19_fegh path: data/book_part19_fegh-* - split: book_part20_fegh path: data/book_part20_fegh-* - split: book_part21_fegh path: data/book_part21_fegh-* - split: book_part22_fegh path: data/book_part22_fegh-* - split: book_part23_fegh path: data/book_part23_fegh-* - split: book_part24_fegh path: data/book_part24_fegh-* - split: book_part25_fegh path: data/book_part25_fegh-* - split: book_part26_fegh path: data/book_part26_fegh-* ---
提供机构:
Amirjalaly
原始信息汇总

数据集概述

特征信息

数据集包含以下特征:

  • url: 字符串类型
  • language: 字符串类型
  • original_nlines: 字符串类型
  • part: 字符串类型
  • page: 字符串类型
  • nlines: 字符串类型
  • length: 字符串类型
  • title: 字符串类型
  • raw_content: 字符串类型
  • date_download: 字符串类型
  • language_score: 字符串类型
  • type: 字符串类型
  • perplexity: 字符串类型
  • original_length: 字符串类型
  • source_domain: 字符串类型

数据分割

数据集分为多个部分,每个部分包含20,000个样本,除了最后一个部分包含13,982个样本。具体信息如下:

  • book_part1_fegh: 49,004,718字节
  • book_part2_fegh: 62,933,514字节
  • book_part3_fegh: 58,078,049字节
  • book_part4_fegh: 58,591,383字节
  • book_part5_fegh: 42,504,116字节
  • book_part6_fegh: 50,998,384字节
  • book_part7_fegh: 52,735,009字节
  • book_part8_fegh: 54,972,205字节
  • book_part9_fegh: 65,020,286字节
  • book_part10_fegh: 54,380,664字节
  • book_part11_fegh: 47,427,339字节
  • book_part12_fegh: 48,398,860字节
  • book_part13_fegh: 45,573,841字节
  • book_part14_fegh: 48,445,623字节
  • book_part15_fegh: 50,559,997字节
  • book_part16_fegh: 51,662,992字节
  • book_part17_fegh: 50,755,938字节
  • book_part18_fegh: 57,893,738字节
  • book_part19_fegh: 57,818,764字节
  • book_part20_fegh: 65,119,365字节
  • book_part21_fegh: 173,500,719字节
  • book_part22_fegh: 53,707,115字节
  • book_part23_fegh: 50,702,659字节
  • book_part24_fegh: 55,158,664字节
  • book_part25_fegh: 50,015,458字节
  • book_part26_fegh: 38,386,325字节

数据集大小

  • 下载大小: 647,726,723字节
  • 数据集大小: 1,494,345,725字节

配置信息

  • config_name: default
    • data_files: 每个部分的路径格式为 data/book_partX_fegh-*,其中 X 为部分编号。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作