museado/digitalcommonwealth-data
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/museado/digitalcommonwealth-data
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: other
pretty_name: Digital Commonwealth Collection Archive
tags:
- digitalcommonwealth
- archives
- cultural-heritage
- jsonl
- metadata
viewer: false
task_categories:
- other
size_categories:
- 10K<n<100K
---
# Digital Commonwealth Collection Archive
This dataset is a bulk archive of collection-level JSONL exports derived from Digital Commonwealth record listings.
Why this exists:
- Digital Commonwealth exposes rich record data, but not as a simple bulk-download page for all collections.
- Repeatedly harvesting the public listing or API for every downstream use creates unnecessary load and requires thousands of requests.
- This archive provides stable, reusable collection files so Museado and others can work from a local or mirrored copy instead of rebuilding the corpus each time.
- Only records with reusable/public rights are archived here. Rows with explicit restricted reuse states such as `contact host` or `all rights reserved` are filtered out during archive creation.
Contents:
- `*.jsonl.gz`: one gzip-compressed JSONL file per collection
- `datasets.jsonl`: collection manifest with dataset keys, labels, archive filenames, counts, and source links
- `institutions.jsonl`: institution-level summary manifest
Upstream:
- Main site: https://www.digitalcommonwealth.org/
- API / developer information: https://www.digitalcommonwealth.org/api
- OAI-PMH endpoint and collection metadata are also used during the collection mapping process
Methodology:
1. Harvest provider-wide listing pages from Digital Commonwealth.
2. Parse each page into JSONL shards.
3. Inspect the Digital Commonwealth field `attributes.reuse_allowed_ssi` for each record.
4. Keep rows where `reuse_allowed_ssi` is missing, `no restrictions`, or `creative commons`.
5. Exclude rows where `reuse_allowed_ssi` is explicitly restricted, including values such as `contact host` and `all rights reserved`.
6. Split the remaining records by collection code.
7. Write one compressed JSONL archive per collection.
8. Emit `datasets.jsonl` and `institutions.jsonl` so downstream systems can bootstrap local datasets without refetching the global listing.
Filtering details:
- Field inspected: `attributes.reuse_allowed_ssi`
- Included values:
- `no restrictions`
- `creative commons`
- missing / blank
- Excluded values:
- `contact host`
- `all rights reserved`
- any other explicit non-open reuse state
Typical downstream use:
- mirror this archive
- materialize local dataset directories
- symlink each collection archive into `05_raw/obj.jsonl.gz`
- normalize and index from the local raw symlink
Bootstrap options:
- Maintainers with the provider-wide listing pages can rebuild the archive locally with `dc:global:archive`.
- Developers and production systems that do not have the listing-page source should start by fetching this published archive from Hugging Face, then materialize local datasets from `datasets.jsonl` with the symlink/bootstrap step.
Maintainer note:
To refresh this archive from a fast connection, run:
```bash
bin/console dc:global:fetch
bin/console dc:global:archive
```
提供机构:
museado



