five

davanstrien/blbooks-parquet-embedded

收藏
Hugging Face2023-07-14 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/davanstrien/blbooks-parquet-embedded
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language_creators: - machine-generated language: - de - en - es - fr - it - nl license: - cc0-1.0 multilinguality: - multilingual size_categories: - 100K<n<1M source_datasets: davanstrien/blbooks-parquet task_categories: - text-generation - fill-mask - other task_ids: - language-modeling - masked-language-modeling pretty_name: British Library Books tags: - embeddings dataset_info: - config_name: all features: - name: record_id dtype: string - name: date dtype: int32 - name: raw_date dtype: string - name: title dtype: string - name: place dtype: string - name: empty_pg dtype: bool - name: text dtype: string - name: pg dtype: int32 - name: mean_wc_ocr dtype: float32 - name: std_wc_ocr dtype: float64 - name: name dtype: string - name: all_names dtype: string - name: Publisher dtype: string - name: Country of publication 1 dtype: string - name: all Countries of publication dtype: string - name: Physical description dtype: string - name: Language_1 dtype: string - name: Language_2 dtype: string - name: Language_3 dtype: string - name: Language_4 dtype: string - name: multi_language dtype: bool splits: - name: train num_bytes: 30394267732 num_examples: 14011953 download_size: 10486035662 dataset_size: 30394267732 - config_name: 1800s features: - name: record_id dtype: string - name: date dtype: int32 - name: raw_date dtype: string - name: title dtype: string - name: place dtype: string - name: empty_pg dtype: bool - name: text dtype: string - name: pg dtype: int32 - name: mean_wc_ocr dtype: float32 - name: std_wc_ocr dtype: float64 - name: name dtype: string - name: all_names dtype: string - name: Publisher dtype: string - name: Country of publication 1 dtype: string - name: all Countries of publication dtype: string - name: Physical description dtype: string - name: Language_1 dtype: string - name: Language_2 dtype: string - name: Language_3 dtype: string - name: Language_4 dtype: string - name: multi_language dtype: bool splits: - name: train num_bytes: 30020434670 num_examples: 13781747 download_size: 10348577602 dataset_size: 30020434670 - config_name: 1700s features: - name: record_id dtype: string - name: date dtype: int32 - name: raw_date dtype: string - name: title dtype: string - name: place dtype: string - name: empty_pg dtype: bool - name: text dtype: string - name: pg dtype: int32 - name: mean_wc_ocr dtype: float32 - name: std_wc_ocr dtype: float64 - name: name dtype: string - name: all_names dtype: string - name: Publisher dtype: string - name: Country of publication 1 dtype: string - name: all Countries of publication dtype: string - name: Physical description dtype: string - name: Language_1 dtype: string - name: Language_2 dtype: string - name: Language_3 dtype: string - name: Language_4 dtype: string - name: multi_language dtype: bool splits: - name: train num_bytes: 266382657 num_examples: 178224 download_size: 95137895 dataset_size: 266382657 - config_name: '1510_1699' features: - name: record_id dtype: string - name: date dtype: timestamp[s] - name: raw_date dtype: string - name: title dtype: string - name: place dtype: string - name: empty_pg dtype: bool - name: text dtype: string - name: pg dtype: int32 - name: mean_wc_ocr dtype: float32 - name: std_wc_ocr dtype: float64 - name: name dtype: string - name: all_names dtype: string - name: Publisher dtype: string - name: Country of publication 1 dtype: string - name: all Countries of publication dtype: string - name: Physical description dtype: string - name: Language_1 dtype: string - name: Language_2 dtype: string - name: Language_3 dtype: string - name: Language_4 dtype: string - name: multi_language dtype: bool splits: - name: train num_bytes: 107667469 num_examples: 51982 download_size: 42320165 dataset_size: 107667469 - config_name: '1500_1899' features: - name: record_id dtype: string - name: date dtype: timestamp[s] - name: raw_date dtype: string - name: title dtype: string - name: place dtype: string - name: empty_pg dtype: bool - name: text dtype: string - name: pg dtype: int32 - name: mean_wc_ocr dtype: float32 - name: std_wc_ocr dtype: float64 - name: name dtype: string - name: all_names dtype: string - name: Publisher dtype: string - name: Country of publication 1 dtype: string - name: all Countries of publication dtype: string - name: Physical description dtype: string - name: Language_1 dtype: string - name: Language_2 dtype: string - name: Language_3 dtype: string - name: Language_4 dtype: string - name: multi_language dtype: bool splits: - name: train num_bytes: 30452067039 num_examples: 14011953 download_size: 10486035662 dataset_size: 30452067039 - config_name: '1800_1899' features: - name: record_id dtype: string - name: date dtype: timestamp[s] - name: raw_date dtype: string - name: title dtype: string - name: place dtype: string - name: empty_pg dtype: bool - name: text dtype: string - name: pg dtype: int32 - name: mean_wc_ocr dtype: float32 - name: std_wc_ocr dtype: float64 - name: name dtype: string - name: all_names dtype: string - name: Publisher dtype: string - name: Country of publication 1 dtype: string - name: all Countries of publication dtype: string - name: Physical description dtype: string - name: Language_1 dtype: string - name: Language_2 dtype: string - name: Language_3 dtype: string - name: Language_4 dtype: string - name: multi_language dtype: bool splits: - name: train num_bytes: 30077284377 num_examples: 13781747 download_size: 10348577602 dataset_size: 30077284377 - config_name: '1700_1799' features: - name: record_id dtype: string - name: date dtype: timestamp[s] - name: raw_date dtype: string - name: title dtype: string - name: place dtype: string - name: empty_pg dtype: bool - name: text dtype: string - name: pg dtype: int32 - name: mean_wc_ocr dtype: float32 - name: std_wc_ocr dtype: float64 - name: name dtype: string - name: all_names dtype: string - name: Publisher dtype: string - name: Country of publication 1 dtype: string - name: all Countries of publication dtype: string - name: Physical description dtype: string - name: Language_1 dtype: string - name: Language_2 dtype: string - name: Language_3 dtype: string - name: Language_4 dtype: string - name: multi_language dtype: bool splits: - name: train num_bytes: 267117831 num_examples: 178224 download_size: 95137895 dataset_size: 267117831 --- # Dataset Card for "blbooks-parquet-embedded" [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
提供机构:
davanstrien
原始信息汇总

数据集概述

基本信息

  • 名称: British Library Books
  • 语言: 多语言(de, en, es, fr, it, nl)
  • 许可证: CC0-1.0
  • 多语言性: 多语言
  • 大小: 100K<n<1M

来源与任务

  • 来源数据集: davanstrien/blbooks-parquet
  • 任务类别: text-generation, fill-mask, other
  • 任务ID: language-modeling, masked-language-modeling

数据集配置与特征

  • 配置名称: all, 1800s, 1700s, 1510_1699, 1500_1899, 1800_1899, 1700_1799
  • 特征:
    • record_id: string
    • date: int32 或 timestamp[s]
    • raw_date: string
    • title: string
    • place: string
    • empty_pg: bool
    • text: string
    • pg: int32
    • mean_wc_ocr: float32
    • std_wc_ocr: float64
    • name: string
    • all_names: string
    • Publisher: string
    • Country of publication 1: string
    • all Countries of publication: string
    • Physical description: string
    • Language_1, Language_2, Language_3, Language_4: string
    • multi_language: bool

数据集拆分

  • 训练集:
    • all:
      • num_bytes: 30394267732
      • num_examples: 14011953
      • download_size: 10486035662
      • dataset_size: 30394267732
    • 1800s:
      • num_bytes: 30020434670
      • num_examples: 13781747
      • download_size: 10348577602
      • dataset_size: 30020434670
    • 1700s:
      • num_bytes: 266382657
      • num_examples: 178224
      • download_size: 95137895
      • dataset_size: 266382657
    • 1510_1699:
      • num_bytes: 107667469
      • num_examples: 51982
      • download_size: 42320165
      • dataset_size: 107667469
    • 1500_1899:
      • num_bytes: 30452067039
      • num_examples: 14011953
      • download_size: 10486035662
      • dataset_size: 30452067039
    • 1800_1899:
      • num_bytes: 30077284377
      • num_examples: 13781747
      • download_size: 10348577602
      • dataset_size: 30077284377
    • 1700_1799:
      • num_bytes: 267117831
      • num_examples: 178224
      • download_size: 95137895
      • dataset_size: 267117831

标签

  • 标签: embeddings
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作