KennyChowww/OpenAlex-Articles-and-Harvard-Library-Item-Data
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/KennyChowww/OpenAlex-Articles-and-Harvard-Library-Item-Data
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: "OpenAlex Articles + Harvard Library Item Data"
license: "cc0-1.0"
multilinguality: multilingual
task_categories:
- other
tags:
- libraries
- bibliographic-metadata
- openalex
- parquet
- tabular
size_categories:
- 100M<n<1B
configs:
- config_name: openalex_articles_and_Harvard_Library_Item_Meta_Data
data_files:
- split: train
path: "data/*.parquet"
- config_name: harvard_library_item_location
data_files:
- split: train
path: "harvard_library_item_location/*.parquet"
---
# OpenAlex Articles + Harvard Library Item Data
## Dataset Description
This repository contains two large tabular subsets built from **Harvard Library Bibliographic Metadata** and an **OpenAlex snapshot**.
These files are **derived, filtered, and processed datasets**, not full reproductions of the original source data. The records are based on real source metadata, but the dataset was prepared mainly for **library data system operations testing**, large-scale experimentation, and workflow development.
The main subset is best understood as a **bibliographic/title-level table** enriched for testing. In other words, a single row represents a bibliographic record or work-level description, while the `copy_count` field shows how many Harvard-linked copies/items are associated with that record. So one title can correspond to multiple physical copies.
The `harvard_library_item_location` subset is a more **item-level / copy-level** table derived from Harvard source data, intended to show the physical location associated with individual library items.
## Source Data
This dataset is derived from:
- **Harvard Library Bibliographic Metadata**
- **OpenAlex snapshot**
Please refer to the original source projects for authoritative documentation, provenance, and update practices.
## Subsets
### 1. `openalex_articles_with_harvard_library.parquet`
Large bibliographic-level dataset combining OpenAlex-style descriptive metadata with Harvard-derived library testing fields.
**Columns**
- `id` — record identifier
- `title` — title of the work
- `author` — author name(s)
- `language` — language code
- `publication_year` — publication year
- `publisher` — Harvard: publisher / OpenAlex: source name
- `isbn` — ISBN,Harvard: when available | OpenAlex: ALL MISSING (NOT BOOK type)
- `doi` — DOI, OpenAlex: when available
- `copy_count` — Harvard: number of linked copies/items | OpenAlex:ALL MISSING
- `type` — Harvard: BOOK, DVD, VIDEO, etc | OpenAlex: ALL "journal article"
- `is_physical` — Harvard: whether the item is physical (`true`) | OpenAlex: ALL(`false`)
### 2. `harvard_library_item_location.parquet`
Item-level Harvard-derived location dataset showing where individual physical items are stored.
**Columns**
- `record_id` — record identifier key to merge back to `openalex_articles_with_harvard_library.parquet`
- `barcode` — unique identifier for the item/copy
- `library_branch` — library or branch code associated with the item
- `location` — location code within that branch, such as a shelving, storage, or service location
- `item_type` — BOOK, DVD, VIDEO, etc
## Intended Use
This dataset is intended for:
- testing library data pipelines
- large-scale query and indexing experiments
- metadata transformation and enrichment workflows
- prototyping discovery, retrieval, and analytics systems
This dataset is **not** intended to serve as a complete or authoritative replacement for the original Harvard Library or OpenAlex source distributions.
## Notes
- This repository contains **processed subsets**, not the complete upstream datasets.
- Coverage, field normalization, and record selection reflect project-specific filtering and transformation steps.
- Null or missing values may appear in fields such as `isbn`, `doi`, or `publisher`.
- The source metadata is real, but this packaged form is optimized for testing and experimentation rather than archival completeness.
提供机构:
KennyChowww



