five

P1ayer-1/college_texts_metadata

收藏
Hugging Face2024-01-30 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/P1ayer-1/college_texts_metadata
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: authors dtype: string - name: color sequence: float64 - name: depth dtype: int64 - name: field dtype: string - name: id dtype: int64 - name: match_count dtype: int64 - name: position sequence: float64 - name: title dtype: string - name: hits list: - name: _id dtype: string - name: _index dtype: string - name: _score dtype: float64 - name: _source struct: - name: aa_lgli_comics_2022_08_file dtype: string - name: aac_zlib3_book dtype: string - name: file_unified_data struct: - name: author_additional sequence: 'null' - name: author_best dtype: string - name: classifications_unified struct: - name: ddc sequence: string - name: lcc sequence: string - name: library_and_archives_canada_cataloguing_in_publication sequence: string - name: nur sequence: string - name: udc sequence: string - name: comments_additional sequence: 'null' - name: comments_best dtype: string - name: content_type dtype: string - name: cover_url_additional sequence: 'null' - name: cover_url_best dtype: string - name: edition_varia_additional sequence: 'null' - name: edition_varia_best dtype: string - name: extension_additional sequence: 'null' - name: extension_best dtype: string - name: filesize_additional sequence: 'null' - name: filesize_best dtype: int64 - name: has_aa_downloads dtype: int64 - name: has_aa_exclusive_downloads dtype: int64 - name: identifiers_unified struct: - name: abaa sequence: string - name: abebooks.de sequence: string - name: abwa_bibliographic_number sequence: string - name: alibris sequence: string - name: alibris_id sequence: string - name: asin sequence: string - name: bayerische_staatsbibliothek sequence: string - name: bcid sequence: string - name: better_world_books sequence: string - name: bhl sequence: string - name: bibliothèque_nationale_de_france sequence: string - name: bibsys sequence: string - name: bl sequence: string - name: bnb sequence: string - name: bodleian,_oxford_university sequence: string - name: booklocker.com sequence: string - name: bookmooch sequence: string - name: booksforyou sequence: string - name: bookwire sequence: string - name: boston_public_library sequence: string - name: canadian_national_library_archive sequence: string - name: choosebooks sequence: string - name: cornell_university_library sequence: string - name: cornell_university_online_library sequence: string - name: dc_books sequence: string - name: depósito_legal sequence: string - name: digital_library_pomerania sequence: string - name: discovereads sequence: string - name: dnb sequence: string - name: dominican_institute_for_oriental_studies_library sequence: string - name: etsc sequence: string - name: fennica sequence: string - name: finnish_public_libraries_classification_system sequence: string - name: folio sequence: string - name: freebase sequence: string - name: gbook sequence: string - name: goethe_university_library,_frankfurt sequence: string - name: goodreads sequence: string - name: grand_comics_database sequence: string - name: harvard sequence: string - name: hathi_trust sequence: string - name: identificativo_sbn sequence: string - name: ilmiolibro sequence: string - name: inducks sequence: string - name: isbn10 sequence: string - name: isbn13 sequence: string - name: isfdbpubideditions sequence: string - name: issn sequence: string - name: istc sequence: string - name: lccn sequence: string - name: learnawesome sequence: string - name: library_and_archives_canada_cataloguing_in_publication sequence: string - name: librarything sequence: string - name: libris sequence: string - name: librivox sequence: string - name: lulu sequence: string - name: magcloud sequence: string - name: nbuv sequence: string - name: ndl sequence: string - name: nla sequence: string - name: nur sequence: string - name: ocaid sequence: string - name: oclc sequence: string - name: ol sequence: string - name: openstax sequence: string - name: overdrive sequence: string - name: paperback_swap sequence: string - name: project_gutenberg sequence: string - name: publishamerica sequence: string - name: rvk sequence: string - name: scribd sequence: string - name: shelfari sequence: string - name: siso sequence: string - name: smashwords_book_download sequence: string - name: standard_ebooks sequence: string - name: storygraph sequence: string - name: ulrls sequence: string - name: ulrls_classmark sequence: string - name: w._w._norton sequence: string - name: wikidata sequence: string - name: wikisource sequence: string - name: yakaboo sequence: string - name: zdb-id sequence: string - name: language_codes sequence: string - name: most_likely_language_code dtype: string - name: original_filename_additional sequence: 'null' - name: original_filename_best dtype: string - name: original_filename_best_name_only dtype: string - name: problems sequence: 'null' - name: publisher_additional sequence: 'null' - name: publisher_best dtype: string - name: stripped_description_additional sequence: 'null' - name: stripped_description_best dtype: string - name: title_additional sequence: 'null' - name: title_best dtype: string - name: year_additional sequence: 'null' - name: year_best dtype: string - name: ia_record dtype: string - name: id dtype: string - name: indexes sequence: string - name: ipfs_infos sequence: 'null' - name: isbndb sequence: 'null' - name: lgli_file dtype: string - name: lgrsfic_book dtype: string - name: lgrsnf_book dtype: string - name: ol list: - name: ol_edition dtype: string - name: scihub_doi sequence: 'null' - name: search_only_fields struct: - name: search_access_types sequence: string - name: search_content_type dtype: string - name: search_doi sequence: 'null' - name: search_extension dtype: string - name: search_filesize dtype: int64 - name: search_isbn13 sequence: string - name: search_most_likely_language_code dtype: string - name: search_record_sources sequence: string - name: search_score_base dtype: float64 - name: search_score_base_rank dtype: float64 - name: search_text dtype: string - name: search_year dtype: string - name: zlib_book dtype: string splits: - name: train num_bytes: 2050799295 num_examples: 565533 download_size: 354984240 dataset_size: 2050799295 configs: - config_name: default data_files: - split: train path: data/train-* ---
提供机构:
P1ayer-1
原始信息汇总

数据集信息

特征

  • authors: 字符串类型
  • color: 浮点数序列类型
  • depth: 64位整数类型
  • field: 字符串类型
  • id: 64位整数类型
  • match_count: 64位整数类型
  • position: 浮点数序列类型
  • title: 字符串类型
  • hits: 列表类型,包含以下字段:
    • _id: 字符串类型
    • _index: 字符串类型
    • _score: 64位浮点数类型
    • _source: 结构体类型,包含以下字段:
      • aa_lgli_comics_2022_08_file: 字符串类型
      • aac_zlib3_book: 字符串类型
      • file_unified_data: 结构体类型,包含以下字段:
        • author_additional: 空序列类型
        • author_best: 字符串类型
        • classifications_unified: 结构体类型,包含以下字段:
          • ddc: 字符串序列类型
          • lcc: 字符串序列类型
          • library_and_archives_canada_cataloguing_in_publication: 字符串序列类型
          • nur: 字符串序列类型
          • udc: 字符串序列类型
        • comments_additional: 空序列类型
        • comments_best: 字符串类型
        • content_type: 字符串类型
        • cover_url_additional: 空序列类型
        • cover_url_best: 字符串类型
        • edition_varia_additional: 空序列类型
        • edition_varia_best: 字符串类型
        • extension_additional: 空序列类型
        • extension_best: 字符串类型
        • filesize_additional: 空序列类型
        • filesize_best: 64位整数类型
        • has_aa_downloads: 64位整数类型
        • has_aa_exclusive_downloads: 64位整数类型
        • identifiers_unified: 结构体类型,包含以下字段:
          • abaa: 字符串序列类型
          • abebooks.de: 字符串序列类型
          • abwa_bibliographic_number: 字符串序列类型
          • alibris: 字符串序列类型
          • alibris_id: 字符串序列类型
          • asin: 字符串序列类型
          • bayerische_staatsbibliothek: 字符串序列类型
          • bcid: 字符串序列类型
          • better_world_books: 字符串序列类型
          • bhl: 字符串序列类型
          • bibliothèque_nationale_de_france: 字符串序列类型
          • bibsys: 字符串序列类型
          • bl: 字符串序列类型
          • bnb: 字符串序列类型
          • bodleian,_oxford_university: 字符串序列类型
          • booklocker.com: 字符串序列类型
          • bookmooch: 字符串序列类型
          • booksforyou: 字符串序列类型
          • bookwire: 字符串序列类型
          • boston_public_library: 字符串序列类型
          • canadian_national_library_archive: 字符串序列类型
          • choosebooks: 字符串序列类型
          • cornell_university_library: 字符串序列类型
          • cornell_university_online_library: 字符串序列类型
          • dc_books: 字符串序列类型
          • depósito_legal: 字符串序列类型
          • digital_library_pomerania: 字符串序列类型
          • discovereads: 字符串序列类型
          • dnb: 字符串序列类型
          • dominican_institute_for_oriental_studies_library: 字符串序列类型
          • etsc: 字符串序列类型
          • fennica: 字符串序列类型
          • finnish_public_libraries_classification_system: 字符串序列类型
          • folio: 字符串序列类型
          • freebase: 字符串序列类型
          • gbook: 字符串序列类型
          • goethe_university_library,_frankfurt: 字符串序列类型
          • goodreads: 字符串序列类型
          • grand_comics_database: 字符串序列类型
          • harvard: 字符串序列类型
          • hathi_trust: 字符串序列类型
          • identificativo_sbn: 字符串序列类型
          • ilmiolibro: 字符串序列类型
          • inducks: 字符串序列类型
          • isbn10: 字符串序列类型
          • isbn13: 字符串序列类型
          • isfdbpubideditions: 字符串序列类型
          • issn: 字符串序列类型
          • istc: 字符串序列类型
          • lccn: 字符串序列类型
          • learnawesome: 字符串序列类型
          • library_and_archives_canada_cataloguing_in_publication: 字符串序列类型
          • librarything: 字符串序列类型
          • libris: 字符串序列类型
          • librivox: 字符串序列类型
          • lulu: 字符串序列类型
          • magcloud: 字符串序列类型
          • nbuv: 字符串序列类型
          • ndl: 字符串序列类型
          • nla: 字符串序列类型
          • nur: 字符串序列类型
          • ocaid: 字符串序列类型
          • oclc: 字符串序列类型
          • ol: 字符串序列类型
          • openstax: 字符串序列类型
          • overdrive: 字符串序列类型
          • paperback_swap: 字符串序列类型
          • project_gutenberg: 字符串序列类型
          • publishamerica: 字符串序列类型
          • rvk: 字符串序列类型
          • scribd: 字符串序列类型
          • shelfari: 字符串序列类型
          • siso: 字符串序列类型
          • smashwords_book_download: 字符串序列类型
          • standard_ebooks: 字符串序列类型
          • storygraph: 字符串序列类型
          • ulrls: 字符串序列类型
          • ulrls_classmark: 字符串序列类型
          • w._w._norton: 字符串序列类型
          • wikidata: 字符串序列类型
          • wikisource: 字符串序列类型
          • yakaboo: 字符串序列类型
          • zdb-id: 字符串序列类型
        • language_codes: 字符串序列类型
        • most_likely_language_code: 字符串类型
        • original_filename_additional: 空序列类型
        • original_filename_best: 字符串类型
        • original_filename_best_name_only: 字符串类型
        • problems: 空序列类型
        • publisher_additional: 空序列类型
        • publisher_best: 字符串类型
        • stripped_description_additional: 空序列类型
        • stripped_description_best: 字符串类型
        • title_additional: 空序列类型
        • title_best: 字符串类型
        • year_additional: 空序列类型
        • year_best: 字符串类型
      • ia_record: 字符串类型
      • id: 字符串类型
      • indexes: 字符串序列类型
      • ipfs_infos: 空序列类型
      • isbndb: 空序列类型
      • lgli_file: 字符串类型
      • lgrsfic_book: 字符串类型
      • lgrsnf_book: 字符串类型
      • ol: 列表类型,包含以下字段:
        • ol_edition: 字符串类型
      • scihub_doi: 空序列类型
      • search_only_fields: 结构体类型,包含以下字段:
        • search_access_types: 字符串序列类型
        • search_content_type: 字符串类型
        • search_doi: 空序列类型
        • search_extension: 字符串类型
        • search_filesize: 64位整数类型
        • search_isbn13: 字符串序列类型
        • search_most_likely_language_code: 字符串类型
        • search_record_sources: 字符串序列类型
        • search_score_base: 64位浮点数类型
        • search_score_base_rank: 64位浮点数类型
        • search_text: 字符串类型
        • search_year: 字符串类型
      • zlib_book: 字符串类型

数据分割

  • train: 包含565533个样本,总字节数为2050799295

数据集大小

  • 下载大小: 354984240字节
  • 数据集大小: 2050799295字节

配置

  • config_name: default
    • data_files:
      • split: train
      • path: data/train-*
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作