davanstrien/blbooks-parquet-embedded
收藏Hugging Face2023-07-14 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/davanstrien/blbooks-parquet-embedded
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- no-annotation
language_creators:
- machine-generated
language:
- de
- en
- es
- fr
- it
- nl
license:
- cc0-1.0
multilinguality:
- multilingual
size_categories:
- 100K<n<1M
source_datasets: davanstrien/blbooks-parquet
task_categories:
- text-generation
- fill-mask
- other
task_ids:
- language-modeling
- masked-language-modeling
pretty_name: British Library Books
tags:
- embeddings
dataset_info:
- config_name: all
features:
- name: record_id
dtype: string
- name: date
dtype: int32
- name: raw_date
dtype: string
- name: title
dtype: string
- name: place
dtype: string
- name: empty_pg
dtype: bool
- name: text
dtype: string
- name: pg
dtype: int32
- name: mean_wc_ocr
dtype: float32
- name: std_wc_ocr
dtype: float64
- name: name
dtype: string
- name: all_names
dtype: string
- name: Publisher
dtype: string
- name: Country of publication 1
dtype: string
- name: all Countries of publication
dtype: string
- name: Physical description
dtype: string
- name: Language_1
dtype: string
- name: Language_2
dtype: string
- name: Language_3
dtype: string
- name: Language_4
dtype: string
- name: multi_language
dtype: bool
splits:
- name: train
num_bytes: 30394267732
num_examples: 14011953
download_size: 10486035662
dataset_size: 30394267732
- config_name: 1800s
features:
- name: record_id
dtype: string
- name: date
dtype: int32
- name: raw_date
dtype: string
- name: title
dtype: string
- name: place
dtype: string
- name: empty_pg
dtype: bool
- name: text
dtype: string
- name: pg
dtype: int32
- name: mean_wc_ocr
dtype: float32
- name: std_wc_ocr
dtype: float64
- name: name
dtype: string
- name: all_names
dtype: string
- name: Publisher
dtype: string
- name: Country of publication 1
dtype: string
- name: all Countries of publication
dtype: string
- name: Physical description
dtype: string
- name: Language_1
dtype: string
- name: Language_2
dtype: string
- name: Language_3
dtype: string
- name: Language_4
dtype: string
- name: multi_language
dtype: bool
splits:
- name: train
num_bytes: 30020434670
num_examples: 13781747
download_size: 10348577602
dataset_size: 30020434670
- config_name: 1700s
features:
- name: record_id
dtype: string
- name: date
dtype: int32
- name: raw_date
dtype: string
- name: title
dtype: string
- name: place
dtype: string
- name: empty_pg
dtype: bool
- name: text
dtype: string
- name: pg
dtype: int32
- name: mean_wc_ocr
dtype: float32
- name: std_wc_ocr
dtype: float64
- name: name
dtype: string
- name: all_names
dtype: string
- name: Publisher
dtype: string
- name: Country of publication 1
dtype: string
- name: all Countries of publication
dtype: string
- name: Physical description
dtype: string
- name: Language_1
dtype: string
- name: Language_2
dtype: string
- name: Language_3
dtype: string
- name: Language_4
dtype: string
- name: multi_language
dtype: bool
splits:
- name: train
num_bytes: 266382657
num_examples: 178224
download_size: 95137895
dataset_size: 266382657
- config_name: '1510_1699'
features:
- name: record_id
dtype: string
- name: date
dtype: timestamp[s]
- name: raw_date
dtype: string
- name: title
dtype: string
- name: place
dtype: string
- name: empty_pg
dtype: bool
- name: text
dtype: string
- name: pg
dtype: int32
- name: mean_wc_ocr
dtype: float32
- name: std_wc_ocr
dtype: float64
- name: name
dtype: string
- name: all_names
dtype: string
- name: Publisher
dtype: string
- name: Country of publication 1
dtype: string
- name: all Countries of publication
dtype: string
- name: Physical description
dtype: string
- name: Language_1
dtype: string
- name: Language_2
dtype: string
- name: Language_3
dtype: string
- name: Language_4
dtype: string
- name: multi_language
dtype: bool
splits:
- name: train
num_bytes: 107667469
num_examples: 51982
download_size: 42320165
dataset_size: 107667469
- config_name: '1500_1899'
features:
- name: record_id
dtype: string
- name: date
dtype: timestamp[s]
- name: raw_date
dtype: string
- name: title
dtype: string
- name: place
dtype: string
- name: empty_pg
dtype: bool
- name: text
dtype: string
- name: pg
dtype: int32
- name: mean_wc_ocr
dtype: float32
- name: std_wc_ocr
dtype: float64
- name: name
dtype: string
- name: all_names
dtype: string
- name: Publisher
dtype: string
- name: Country of publication 1
dtype: string
- name: all Countries of publication
dtype: string
- name: Physical description
dtype: string
- name: Language_1
dtype: string
- name: Language_2
dtype: string
- name: Language_3
dtype: string
- name: Language_4
dtype: string
- name: multi_language
dtype: bool
splits:
- name: train
num_bytes: 30452067039
num_examples: 14011953
download_size: 10486035662
dataset_size: 30452067039
- config_name: '1800_1899'
features:
- name: record_id
dtype: string
- name: date
dtype: timestamp[s]
- name: raw_date
dtype: string
- name: title
dtype: string
- name: place
dtype: string
- name: empty_pg
dtype: bool
- name: text
dtype: string
- name: pg
dtype: int32
- name: mean_wc_ocr
dtype: float32
- name: std_wc_ocr
dtype: float64
- name: name
dtype: string
- name: all_names
dtype: string
- name: Publisher
dtype: string
- name: Country of publication 1
dtype: string
- name: all Countries of publication
dtype: string
- name: Physical description
dtype: string
- name: Language_1
dtype: string
- name: Language_2
dtype: string
- name: Language_3
dtype: string
- name: Language_4
dtype: string
- name: multi_language
dtype: bool
splits:
- name: train
num_bytes: 30077284377
num_examples: 13781747
download_size: 10348577602
dataset_size: 30077284377
- config_name: '1700_1799'
features:
- name: record_id
dtype: string
- name: date
dtype: timestamp[s]
- name: raw_date
dtype: string
- name: title
dtype: string
- name: place
dtype: string
- name: empty_pg
dtype: bool
- name: text
dtype: string
- name: pg
dtype: int32
- name: mean_wc_ocr
dtype: float32
- name: std_wc_ocr
dtype: float64
- name: name
dtype: string
- name: all_names
dtype: string
- name: Publisher
dtype: string
- name: Country of publication 1
dtype: string
- name: all Countries of publication
dtype: string
- name: Physical description
dtype: string
- name: Language_1
dtype: string
- name: Language_2
dtype: string
- name: Language_3
dtype: string
- name: Language_4
dtype: string
- name: multi_language
dtype: bool
splits:
- name: train
num_bytes: 267117831
num_examples: 178224
download_size: 95137895
dataset_size: 267117831
---
# Dataset Card for "blbooks-parquet-embedded"
[More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
提供机构:
davanstrien
原始信息汇总
数据集概述
基本信息
- 名称: British Library Books
- 语言: 多语言(de, en, es, fr, it, nl)
- 许可证: CC0-1.0
- 多语言性: 多语言
- 大小: 100K<n<1M
来源与任务
- 来源数据集: davanstrien/blbooks-parquet
- 任务类别: text-generation, fill-mask, other
- 任务ID: language-modeling, masked-language-modeling
数据集配置与特征
- 配置名称: all, 1800s, 1700s, 1510_1699, 1500_1899, 1800_1899, 1700_1799
- 特征:
- record_id: string
- date: int32 或 timestamp[s]
- raw_date: string
- title: string
- place: string
- empty_pg: bool
- text: string
- pg: int32
- mean_wc_ocr: float32
- std_wc_ocr: float64
- name: string
- all_names: string
- Publisher: string
- Country of publication 1: string
- all Countries of publication: string
- Physical description: string
- Language_1, Language_2, Language_3, Language_4: string
- multi_language: bool
数据集拆分
- 训练集:
- all:
- num_bytes: 30394267732
- num_examples: 14011953
- download_size: 10486035662
- dataset_size: 30394267732
- 1800s:
- num_bytes: 30020434670
- num_examples: 13781747
- download_size: 10348577602
- dataset_size: 30020434670
- 1700s:
- num_bytes: 266382657
- num_examples: 178224
- download_size: 95137895
- dataset_size: 266382657
- 1510_1699:
- num_bytes: 107667469
- num_examples: 51982
- download_size: 42320165
- dataset_size: 107667469
- 1500_1899:
- num_bytes: 30452067039
- num_examples: 14011953
- download_size: 10486035662
- dataset_size: 30452067039
- 1800_1899:
- num_bytes: 30077284377
- num_examples: 13781747
- download_size: 10348577602
- dataset_size: 30077284377
- 1700_1799:
- num_bytes: 267117831
- num_examples: 178224
- download_size: 95137895
- dataset_size: 267117831
- all:
标签
- 标签: embeddings



