storytracer/openlibrary_dump_2024-04-30
收藏Hugging Face2024-05-26 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/storytracer/openlibrary_dump_2024-04-30
下载链接
链接失效反馈官方服务:
资源简介:
---
license: unknown
license_link: https://openlibrary.org/developers/licensing
pretty_name: OpenLibrary Dump (2024-04-30)
size_categories:
- 10M<n<100M
configs:
- config_name: dumps
default: true
data_files:
- split: authors
path: "data/parquet/ol_dump_authors_2024-04-30.parquet"
- split: works
path: "data/parquet/ol_dump_works_2024-04-30.parquet"
- split: editions
path: "data/parquet/ol_dump_editions_2024-04-30.parquet"
- config_name: summaries
data_files:
- split: authors
path: "summaries/parquet/ol_dump_authors_2024-04-30_summary.parquet"
- split: works
path: "summaries/parquet/ol_dump_works_2024-04-30_summary.parquet"
- split: editions
path: "summaries/parquet/ol_dump_editions_2024-04-30_summary.parquet"
---
# OpenLibrary Dump (2024-04-30)
This dataset contains the [OpenLibrary dump](https://openlibrary.org/developers/dumps) of April 2024 converted to Parquet and DuckDB for easier querying.
## Formats
### Original GZIP dumps
The original GZIP dumps are available at [data/dumps](https://huggingface.co/datasets/storytracer/openlibrary_dump_2024-04-30/tree/main/data/dumps). The dumps are gzipped TSV files with the original OL JSON record contained in the fifth column of the TSV.
### DuckDB
The authors, works and editions dumps were imported as tables into [data/duckdb/ol_dump_2024-04-30.duckdb](https://huggingface.co/datasets/storytracer/openlibrary_dump_2024-04-30/blob/main/data/duckdb/ol_dump_2024-04-30.duckdb) using the script [ol_duckdb.sh](https://huggingface.co/datasets/storytracer/openlibrary_dump_2024-04-30/blob/main/ol_duckdb.sh). The scripts extract the JSON record from the fifth column and pipes it to DuckDB to import using the flags ```union_by_name=true``` and ```ignore_errors=true``` to account for the inconsistent JSON structure of the dumps.
### Parquet
The authors, works and editions tables were exported as Parquet from DuckDB to [data/parquet](https://huggingface.co/datasets/storytracer/openlibrary_dump_2024-04-30/tree/main/data/parquet). These Parquet files are contained in the default ```data``` config for this dataset.
### Table Summaries
To give users an easy overview of the fields contained in the dump, the DuckDB tables have been summarized using the ```SUMMARIZE``` function as Markdown and Parquet files at [summaries](https://huggingface.co/datasets/storytracer/openlibrary_dump_2024-04-30/tree/main/summaries), excluding the summary columns ```min```, ```max```, ```avg``` for easier viewing. You can also explore the table summaries using the dataset viewer by selecting the ```summaries``` config.
The dumps fields are supposed to be consistent with the schema referenced [in the documentation](https://openlibrary.org/developers/dumps#:~:text=Format%20of%20JSON%20records). However, the summaries show that the dumps are not consistent with the documentation: the dumps contain some undocumented fields and some of the fields almost exclusively contain ```null``` values.
提供机构:
storytracer
原始信息汇总
OpenLibrary Dump (2024-04-30) 数据集概述
基本信息
- 许可证: 未知
- 数据集名称: OpenLibrary Dump (2024-04-30)
- 大小: 10M<n<100M
数据配置
-
默认配置:
dumps- 数据文件:
authors: 路径为 "data/parquet/ol_dump_authors_2024-04-30.parquet"works: 路径为 "data/parquet/ol_dump_works_2024-04-30.parquet"editions: 路径为 "data/parquet/ol_dump_editions_2024-04-30.parquet"
- 数据文件:
-
配置:
summaries- 数据文件:
authors: 路径为 "summaries/parquet/ol_dump_authors_2024-04-30_summary.parquet"works: 路径为 "summaries/parquet/ol_dump_works_2024-04-30_summary.parquet"editions: 路径为 "summaries/parquet/ol_dump_editions_2024-04-30_summary.parquet"
- 数据文件:
数据格式
- 原始格式: GZIP 压缩的 TSV 文件,包含原始的 OL JSON 记录。
- 转换格式:
- DuckDB: 使用脚本
ol_duckdb.sh将数据导入到data/duckdb/ol_dump_2024-04-30.duckdb。 - Parquet: 从 DuckDB 导出到
data/parquet目录。 - 表摘要: 使用
SUMMARIZE函数生成的 Markdown 和 Parquet 文件,位于summaries目录。
- DuckDB: 使用脚本
数据一致性
- 问题: 数据集中的字段与文档描述不完全一致,包含一些未记录的字段,且部分字段几乎全为
null值。
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



