AndreaBozzo/ceres-open-data-index
收藏Hugging Face2026-03-30 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/AndreaBozzo/ceres-open-data-index
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-classification
- feature-extraction
language:
- it
- uk
- en
- de
- fr
- es
- ro
- ja
tags:
- open-data
- ckan
- dcat
- government
- metadata
- geospatial
pretty_name: Ceres Open Data Index
size_categories:
- 100K<n<1M
---
# Ceres Open Data Index
A curated, deduplicated index of **890,143 open datasets** from **32 open data portals** across 13 countries plus international sources. This is the largest aggregated open data metadata index available as a single downloadable resource.
## Dataset Description
This dataset contains normalized metadata from government and institutional open data portals. It was harvested by [Ceres](https://github.com/AndreaBozzo/Ceres), a search engine for open data, via CKAN and DCAT-AP APIs. Metadata has been flattened into a tabular schema, with noise filtered and cross-portal duplicates flagged.
### What's included
- Dataset titles, descriptions, URLs, tags, organizations, and licenses from 32 open data portals
- Cross-portal duplicate detection (`is_duplicate` flag)
- Portal-level language annotations
- Per-portal Parquet subsets for targeted analysis
### What's NOT included
- The actual data files (CSVs, shapefiles, etc.) -- only metadata
- Embedding vectors (model-specific; regenerate with any embedding model)
- Resource-level metadata (individual file URLs within datasets)
## Dataset Structure
### Schema
| Column | Type | Nullable | Description |
|--------|------|----------|-------------|
| `original_id` | string | no | Dataset ID from the source portal |
| `source_portal` | string | no | Portal base URL |
| `portal_name` | string | no | Human-readable portal name |
| `url` | string | no | Direct URL to the dataset page |
| `title` | string | no | Dataset title |
| `description` | string | yes | Dataset description |
| `tags` | string | yes | Comma-separated tag names |
| `organization` | string | yes | Publishing organization |
| `license` | string | yes | License title or identifier |
| `metadata_created` | string | yes | Original creation date (ISO 8601) |
| `metadata_modified` | string | yes | Last modification date (ISO 8601) |
| `first_seen_at` | string | no | When Ceres first indexed this dataset (RFC 3339) |
| `language` | string | yes | Primary language code (e.g. `it`, `en`, `de`) |
| `is_duplicate` | boolean | no | True if the same title appears in another portal |
### Files
- `all.parquet` -- Complete dataset (890,143 rows)
- `data/<portal>.parquet` -- Per-portal subsets (32 files)
- `metadata.json` -- Snapshot metadata with counts
### Splits by portal
| Portal | Country | Datasets |
|--------|---------|----------|
| catalog.data.gov | USA | 399,462 |
| data.gov.au | Australia | 118,546 |
| data.gouv.fr | France | 72,208 |
| dati.gov.it | Italy | 69,603 |
| open.canada.ca | Canada | 46,683 |
| data.gov.ua | Ukraine | 40,055 |
| data.humdata.org | International | 26,842 |
| ckan.open.nrw.de | Germany | 23,353 |
| data.gov.ie | Ireland | 20,677 |
| ckan.opendata.swiss | Switzerland | 12,931 |
| dati.toscana.it | Italy | 12,675 |
| catalog.data.metro.tokyo.lg.jp | Japan | 8,609 |
| discover.data.vic.gov.au | Australia | 5,451 |
| dati.regione.marche.it | Italy | 5,427 |
| data.gov.ro | Romania | 4,880 |
| catalogue.data.gov.bc.ca | Canada | 3,334 |
| dati.emilia-romagna.it | Italy | 3,118 |
| opendata.aragon.es | Spain | 2,882 |
| datos.gob.cl | Chile | 2,815 |
| dati.comune.milano.it | Italy | 2,580 |
| data.public.lu | Luxembourg | 2,505 |
| dati.puglia.it | Italy | 1,782 |
| dati.trentino.it | Italy | 1,377 |
| dati.regione.umbria.it | Italy | 448 |
| dati.lazio.it | Italy | 403 |
| dati.comune.roma.it | Italy | 365 |
| dati.regione.campania.it | Italy | 351 |
| opendata-hro.de | Germany | 280 |
| dati.regione.sicilia.it | Italy | 186 |
| dati.comune.genova.it | Italy | 158 |
| dati.regione.liguria.it | Italy | 124 |
| dati.comune.napoli.it | Italy | 33 |
### Loading the dataset
```python
import pandas as pd
# Load all datasets
df = pd.read_parquet("all.parquet")
# Load a specific portal
df_milano = pd.read_parquet("data/milano.parquet")
# Filter non-duplicates
unique = df[~df["is_duplicate"]]
```
Or with the `datasets` library:
```python
from datasets import load_dataset
ds = load_dataset("AndreaBozzo/ceres-open-data-index")
```
## Dataset Creation
### Harvesting methodology
1. **Discovery**: Each portal's CKAN API (`/api/3/action/package_list`) provides the full list of dataset IDs
2. **Fetching**: Dataset metadata is retrieved via `/api/3/action/package_show` with concurrent requests (10 parallel, circuit breaker for resilience)
3. **Normalization**: CKAN package JSON is normalized into a flat schema with title, description, URL, and content hash (SHA-256)
4. **Incremental sync**: Only new or modified datasets are re-fetched on subsequent runs, using content hashes and CKAN's `package_search?fq=metadata_modified:[timestamp TO *]`
5. **Export**: The `ceres export --format parquet` command streams all datasets from PostgreSQL, applies curation filters, flattens JSONB metadata, detects cross-portal duplicates, and writes compressed Parquet files
### Curation
The following curation steps are applied during export:
- **Noise filtering** (10,331 rows removed): Titles shorter than 5 characters, titles containing "test"/"prova"/"esempio", datasets with empty descriptions
- **Duplicate flagging** (75,527 rows marked): Datasets whose title (case-insensitive) appears in more than one portal are flagged with `is_duplicate=true` but kept in the dataset
- **Metadata flattening**: Tags, organization, and license are extracted from nested CKAN JSON into top-level string columns
### Source data
All data comes from publicly accessible CKAN API endpoints. No authentication is required. The portals are government-operated or institutional open data catalogs.
### Update frequency
This snapshot was generated on **2026-03-30**. The Ceres harvester can refresh the index at any time; new snapshots are published monthly.
## Considerations
### Known biases
- **Geographic skew**: The USA (catalog.data.gov) is the largest single portal (~45%), followed by Australia (~13% across federal + Victoria), France (~8%), and Italy (~11% across all Italian portals). The USA alone accounts for nearly half of the index
- **Language distribution**: Predominantly English (USA, Australia, Canada, Ireland, HDX) and Italian, with significant French (France, Luxembourg, Canada), Ukrainian, German, Japanese, Spanish, and Romanian content
- **Portal selection bias**: Primarily CKAN-based portals, with some DCAT-AP portals (France, Luxembourg). Major platforms using Socrata (UK) or custom APIs are not yet represented
- **Aggregation duplicates**: dati.gov.it aggregates regional Italian data, so ~8% of rows are flagged as cross-portal duplicates with regional portals
### Limitations
- Metadata quality varies by portal -- some datasets have minimal descriptions or missing fields
- Tags and organization names are not standardized across portals
- The `metadata_created` and `metadata_modified` fields are portal-reported and may not reflect actual data updates
- License information is inconsistent; some portals use identifiers (`cc-by`), others use full titles (`Creative Commons Attribution 4.0`)
### Ethical considerations
This dataset contains only public metadata (titles, descriptions, tags) from government open data portals. It does not contain personal data. All source portals publish data under open licenses (typically CC-BY, CC0, or equivalent).
## Additional Information
### License
This dataset is released under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0), consistent with the Ceres project. The underlying metadata is sourced from portals that publish under open licenses (CC-BY, CC0, ODbL, or national open data licenses).
### Citation
```bibtex
@misc{ceres-open-data-index-2026,
title={Ceres Open Data Index},
author={Andrea Bozzo},
year={2026},
url={https://github.com/AndreaBozzo/Ceres},
note={Aggregated metadata from 32 open data portals, snapshot 2026-03-30}
}
```
### Links
- [Ceres on GitHub](https://github.com/AndreaBozzo/Ceres)
- [CKAN API documentation](https://docs.ckan.org/en/latest/api/)
提供机构:
AndreaBozzo



