storytracer/openlibrary_dump_2024-04-30

Name: storytracer/openlibrary_dump_2024-04-30
Creator: storytracer
Published: 2024-05-26 14:28:32
License: 暂无描述

Hugging Face2024-05-26 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/storytracer/openlibrary_dump_2024-04-30

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: unknown license_link: https://openlibrary.org/developers/licensing pretty_name: OpenLibrary Dump (2024-04-30) size_categories: - 10M<n<100M configs: - config_name: dumps default: true data_files: - split: authors path: "data/parquet/ol_dump_authors_2024-04-30.parquet" - split: works path: "data/parquet/ol_dump_works_2024-04-30.parquet" - split: editions path: "data/parquet/ol_dump_editions_2024-04-30.parquet" - config_name: summaries data_files: - split: authors path: "summaries/parquet/ol_dump_authors_2024-04-30_summary.parquet" - split: works path: "summaries/parquet/ol_dump_works_2024-04-30_summary.parquet" - split: editions path: "summaries/parquet/ol_dump_editions_2024-04-30_summary.parquet" --- # OpenLibrary Dump (2024-04-30) This dataset contains the [OpenLibrary dump](https://openlibrary.org/developers/dumps) of April 2024 converted to Parquet and DuckDB for easier querying. ## Formats ### Original GZIP dumps The original GZIP dumps are available at [data/dumps](https://huggingface.co/datasets/storytracer/openlibrary_dump_2024-04-30/tree/main/data/dumps). The dumps are gzipped TSV files with the original OL JSON record contained in the fifth column of the TSV. ### DuckDB The authors, works and editions dumps were imported as tables into [data/duckdb/ol_dump_2024-04-30.duckdb](https://huggingface.co/datasets/storytracer/openlibrary_dump_2024-04-30/blob/main/data/duckdb/ol_dump_2024-04-30.duckdb) using the script [ol_duckdb.sh](https://huggingface.co/datasets/storytracer/openlibrary_dump_2024-04-30/blob/main/ol_duckdb.sh). The scripts extract the JSON record from the fifth column and pipes it to DuckDB to import using the flags ```union_by_name=true``` and ```ignore_errors=true``` to account for the inconsistent JSON structure of the dumps. ### Parquet The authors, works and editions tables were exported as Parquet from DuckDB to [data/parquet](https://huggingface.co/datasets/storytracer/openlibrary_dump_2024-04-30/tree/main/data/parquet). These Parquet files are contained in the default ```data``` config for this dataset. ### Table Summaries To give users an easy overview of the fields contained in the dump, the DuckDB tables have been summarized using the ```SUMMARIZE``` function as Markdown and Parquet files at [summaries](https://huggingface.co/datasets/storytracer/openlibrary_dump_2024-04-30/tree/main/summaries), excluding the summary columns ```min```, ```max```, ```avg``` for easier viewing. You can also explore the table summaries using the dataset viewer by selecting the ```summaries``` config. The dumps fields are supposed to be consistent with the schema referenced [in the documentation](https://openlibrary.org/developers/dumps#:~:text=Format%20of%20JSON%20records). However, the summaries show that the dumps are not consistent with the documentation: the dumps contain some undocumented fields and some of the fields almost exclusively contain ```null``` values.

提供机构：

storytracer

原始信息汇总

OpenLibrary Dump (2024-04-30) 数据集概述

基本信息

许可证: 未知
数据集名称: OpenLibrary Dump (2024-04-30)
大小: 10M<n<100M

数据配置

默认配置: dumps
- 数据文件:
  - authors: 路径为 "data/parquet/ol_dump_authors_2024-04-30.parquet"
  - works: 路径为 "data/parquet/ol_dump_works_2024-04-30.parquet"
  - editions: 路径为 "data/parquet/ol_dump_editions_2024-04-30.parquet"
配置: summaries
- 数据文件:
  - authors: 路径为 "summaries/parquet/ol_dump_authors_2024-04-30_summary.parquet"
  - works: 路径为 "summaries/parquet/ol_dump_works_2024-04-30_summary.parquet"
  - editions: 路径为 "summaries/parquet/ol_dump_editions_2024-04-30_summary.parquet"

数据格式

原始格式: GZIP 压缩的 TSV 文件，包含原始的 OL JSON 记录。
转换格式:
- DuckDB: 使用脚本 ol_duckdb.sh 将数据导入到 data/duckdb/ol_dump_2024-04-30.duckdb。
- Parquet: 从 DuckDB 导出到 data/parquet 目录。
- 表摘要: 使用 SUMMARIZE 函数生成的 Markdown 和 Parquet 文件，位于 summaries 目录。

数据一致性

问题: 数据集中的字段与文档描述不完全一致，包含一些未记录的字段，且部分字段几乎全为 null 值。

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集