five

museado/digitalcommonwealth-data

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/museado/digitalcommonwealth-data
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: other pretty_name: Digital Commonwealth Collection Archive tags: - digitalcommonwealth - archives - cultural-heritage - jsonl - metadata viewer: false task_categories: - other size_categories: - 10K<n<100K --- # Digital Commonwealth Collection Archive This dataset is a bulk archive of collection-level JSONL exports derived from Digital Commonwealth record listings. Why this exists: - Digital Commonwealth exposes rich record data, but not as a simple bulk-download page for all collections. - Repeatedly harvesting the public listing or API for every downstream use creates unnecessary load and requires thousands of requests. - This archive provides stable, reusable collection files so Museado and others can work from a local or mirrored copy instead of rebuilding the corpus each time. - Only records with reusable/public rights are archived here. Rows with explicit restricted reuse states such as `contact host` or `all rights reserved` are filtered out during archive creation. Contents: - `*.jsonl.gz`: one gzip-compressed JSONL file per collection - `datasets.jsonl`: collection manifest with dataset keys, labels, archive filenames, counts, and source links - `institutions.jsonl`: institution-level summary manifest Upstream: - Main site: https://www.digitalcommonwealth.org/ - API / developer information: https://www.digitalcommonwealth.org/api - OAI-PMH endpoint and collection metadata are also used during the collection mapping process Methodology: 1. Harvest provider-wide listing pages from Digital Commonwealth. 2. Parse each page into JSONL shards. 3. Inspect the Digital Commonwealth field `attributes.reuse_allowed_ssi` for each record. 4. Keep rows where `reuse_allowed_ssi` is missing, `no restrictions`, or `creative commons`. 5. Exclude rows where `reuse_allowed_ssi` is explicitly restricted, including values such as `contact host` and `all rights reserved`. 6. Split the remaining records by collection code. 7. Write one compressed JSONL archive per collection. 8. Emit `datasets.jsonl` and `institutions.jsonl` so downstream systems can bootstrap local datasets without refetching the global listing. Filtering details: - Field inspected: `attributes.reuse_allowed_ssi` - Included values: - `no restrictions` - `creative commons` - missing / blank - Excluded values: - `contact host` - `all rights reserved` - any other explicit non-open reuse state Typical downstream use: - mirror this archive - materialize local dataset directories - symlink each collection archive into `05_raw/obj.jsonl.gz` - normalize and index from the local raw symlink Bootstrap options: - Maintainers with the provider-wide listing pages can rebuild the archive locally with `dc:global:archive`. - Developers and production systems that do not have the listing-page source should start by fetching this published archive from Hugging Face, then materialize local datasets from `datasets.jsonl` with the symlink/bootstrap step. Maintainer note: To refresh this archive from a fast connection, run: ```bash bin/console dc:global:fetch bin/console dc:global:archive ```
提供机构:
museado
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作