five

storytracer/openlibrary_dump_2024-04-30

收藏
Hugging Face2024-05-26 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/storytracer/openlibrary_dump_2024-04-30
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: unknown license_link: https://openlibrary.org/developers/licensing pretty_name: OpenLibrary Dump (2024-04-30) size_categories: - 10M<n<100M configs: - config_name: dumps default: true data_files: - split: authors path: "data/parquet/ol_dump_authors_2024-04-30.parquet" - split: works path: "data/parquet/ol_dump_works_2024-04-30.parquet" - split: editions path: "data/parquet/ol_dump_editions_2024-04-30.parquet" - config_name: summaries data_files: - split: authors path: "summaries/parquet/ol_dump_authors_2024-04-30_summary.parquet" - split: works path: "summaries/parquet/ol_dump_works_2024-04-30_summary.parquet" - split: editions path: "summaries/parquet/ol_dump_editions_2024-04-30_summary.parquet" --- # OpenLibrary Dump (2024-04-30) This dataset contains the [OpenLibrary dump](https://openlibrary.org/developers/dumps) of April 2024 converted to Parquet and DuckDB for easier querying. ## Formats ### Original GZIP dumps The original GZIP dumps are available at [data/dumps](https://huggingface.co/datasets/storytracer/openlibrary_dump_2024-04-30/tree/main/data/dumps). The dumps are gzipped TSV files with the original OL JSON record contained in the fifth column of the TSV. ### DuckDB The authors, works and editions dumps were imported as tables into [data/duckdb/ol_dump_2024-04-30.duckdb](https://huggingface.co/datasets/storytracer/openlibrary_dump_2024-04-30/blob/main/data/duckdb/ol_dump_2024-04-30.duckdb) using the script [ol_duckdb.sh](https://huggingface.co/datasets/storytracer/openlibrary_dump_2024-04-30/blob/main/ol_duckdb.sh). The scripts extract the JSON record from the fifth column and pipes it to DuckDB to import using the flags ```union_by_name=true``` and ```ignore_errors=true``` to account for the inconsistent JSON structure of the dumps. ### Parquet The authors, works and editions tables were exported as Parquet from DuckDB to [data/parquet](https://huggingface.co/datasets/storytracer/openlibrary_dump_2024-04-30/tree/main/data/parquet). These Parquet files are contained in the default ```data``` config for this dataset. ### Table Summaries To give users an easy overview of the fields contained in the dump, the DuckDB tables have been summarized using the ```SUMMARIZE``` function as Markdown and Parquet files at [summaries](https://huggingface.co/datasets/storytracer/openlibrary_dump_2024-04-30/tree/main/summaries), excluding the summary columns ```min```, ```max```, ```avg``` for easier viewing. You can also explore the table summaries using the dataset viewer by selecting the ```summaries``` config. The dumps fields are supposed to be consistent with the schema referenced [in the documentation](https://openlibrary.org/developers/dumps#:~:text=Format%20of%20JSON%20records). However, the summaries show that the dumps are not consistent with the documentation: the dumps contain some undocumented fields and some of the fields almost exclusively contain ```null``` values.
提供机构:
storytracer
原始信息汇总

OpenLibrary Dump (2024-04-30) 数据集概述

基本信息

  • 许可证: 未知
  • 数据集名称: OpenLibrary Dump (2024-04-30)
  • 大小: 10M<n<100M

数据配置

  • 默认配置: dumps

    • 数据文件:
      • authors: 路径为 "data/parquet/ol_dump_authors_2024-04-30.parquet"
      • works: 路径为 "data/parquet/ol_dump_works_2024-04-30.parquet"
      • editions: 路径为 "data/parquet/ol_dump_editions_2024-04-30.parquet"
  • 配置: summaries

    • 数据文件:
      • authors: 路径为 "summaries/parquet/ol_dump_authors_2024-04-30_summary.parquet"
      • works: 路径为 "summaries/parquet/ol_dump_works_2024-04-30_summary.parquet"
      • editions: 路径为 "summaries/parquet/ol_dump_editions_2024-04-30_summary.parquet"

数据格式

  • 原始格式: GZIP 压缩的 TSV 文件,包含原始的 OL JSON 记录。
  • 转换格式:
    • DuckDB: 使用脚本 ol_duckdb.sh 将数据导入到 data/duckdb/ol_dump_2024-04-30.duckdb
    • Parquet: 从 DuckDB 导出到 data/parquet 目录。
    • 表摘要: 使用 SUMMARIZE 函数生成的 Markdown 和 Parquet 文件,位于 summaries 目录。

数据一致性

  • 问题: 数据集中的字段与文档描述不完全一致,包含一些未记录的字段,且部分字段几乎全为 null 值。
搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作