KennyChowww/OpenAlex-Articles-and-Harvard-Library-Item-Data

Name: KennyChowww/OpenAlex-Articles-and-Harvard-Library-Item-Data
Creator: KennyChowww
Published: 2026-04-20 09:33:57
License: 暂无描述

Hugging Face2026-04-20 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/KennyChowww/OpenAlex-Articles-and-Harvard-Library-Item-Data

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: "OpenAlex Articles + Harvard Library Item Data" license: "cc0-1.0" multilinguality: multilingual task_categories: - other tags: - libraries - bibliographic-metadata - openalex - parquet - tabular size_categories: - 100M<n<1B configs: - config_name: openalex_articles_and_Harvard_Library_Item_Meta_Data data_files: - split: train path: "data/*.parquet" - config_name: harvard_library_item_location data_files: - split: train path: "harvard_library_item_location/*.parquet" --- # OpenAlex Articles + Harvard Library Item Data ## Dataset Description This repository contains two large tabular subsets built from **Harvard Library Bibliographic Metadata** and an **OpenAlex snapshot**. These files are **derived, filtered, and processed datasets**, not full reproductions of the original source data. The records are based on real source metadata, but the dataset was prepared mainly for **library data system operations testing**, large-scale experimentation, and workflow development. The main subset is best understood as a **bibliographic/title-level table** enriched for testing. In other words, a single row represents a bibliographic record or work-level description, while the `copy_count` field shows how many Harvard-linked copies/items are associated with that record. So one title can correspond to multiple physical copies. The `harvard_library_item_location` subset is a more **item-level / copy-level** table derived from Harvard source data, intended to show the physical location associated with individual library items. ## Source Data This dataset is derived from: - **Harvard Library Bibliographic Metadata** - **OpenAlex snapshot** Please refer to the original source projects for authoritative documentation, provenance, and update practices. ## Subsets ### 1. `openalex_articles_with_harvard_library.parquet` Large bibliographic-level dataset combining OpenAlex-style descriptive metadata with Harvard-derived library testing fields. **Columns** - `id` — record identifier - `title` — title of the work - `author` — author name(s) - `language` — language code - `publication_year` — publication year - `publisher` — Harvard: publisher / OpenAlex: source name - `isbn` — ISBN,Harvard: when available | OpenAlex: ALL MISSING (NOT BOOK type) - `doi` — DOI, OpenAlex: when available - `copy_count` — Harvard: number of linked copies/items | OpenAlex:ALL MISSING - `type` — Harvard: BOOK, DVD, VIDEO, etc | OpenAlex: ALL "journal article" - `is_physical` — Harvard: whether the item is physical (`true`) | OpenAlex: ALL(`false`) ### 2. `harvard_library_item_location.parquet` Item-level Harvard-derived location dataset showing where individual physical items are stored. **Columns** - `record_id` — record identifier key to merge back to `openalex_articles_with_harvard_library.parquet` - `barcode` — unique identifier for the item/copy - `library_branch` — library or branch code associated with the item - `location` — location code within that branch, such as a shelving, storage, or service location - `item_type` — BOOK, DVD, VIDEO, etc ## Intended Use This dataset is intended for: - testing library data pipelines - large-scale query and indexing experiments - metadata transformation and enrichment workflows - prototyping discovery, retrieval, and analytics systems This dataset is **not** intended to serve as a complete or authoritative replacement for the original Harvard Library or OpenAlex source distributions. ## Notes - This repository contains **processed subsets**, not the complete upstream datasets. - Coverage, field normalization, and record selection reflect project-specific filtering and transformation steps. - Null or missing values may appear in fields such as `isbn`, `doi`, or `publisher`. - The source metadata is real, but this packaged form is optimized for testing and experimentation rather than archival completeness.

提供机构：

KennyChowww

5,000+

优质数据集

54 个

任务类型

进入经典数据集