five

AndreaBozzo/ceres-open-data-index

收藏
Hugging Face2026-03-30 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/AndreaBozzo/ceres-open-data-index
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-classification - feature-extraction language: - it - uk - en - de - fr - es - ro - ja tags: - open-data - ckan - dcat - government - metadata - geospatial pretty_name: Ceres Open Data Index size_categories: - 100K<n<1M --- # Ceres Open Data Index A curated, deduplicated index of **890,143 open datasets** from **32 open data portals** across 13 countries plus international sources. This is the largest aggregated open data metadata index available as a single downloadable resource. ## Dataset Description This dataset contains normalized metadata from government and institutional open data portals. It was harvested by [Ceres](https://github.com/AndreaBozzo/Ceres), a search engine for open data, via CKAN and DCAT-AP APIs. Metadata has been flattened into a tabular schema, with noise filtered and cross-portal duplicates flagged. ### What's included - Dataset titles, descriptions, URLs, tags, organizations, and licenses from 32 open data portals - Cross-portal duplicate detection (`is_duplicate` flag) - Portal-level language annotations - Per-portal Parquet subsets for targeted analysis ### What's NOT included - The actual data files (CSVs, shapefiles, etc.) -- only metadata - Embedding vectors (model-specific; regenerate with any embedding model) - Resource-level metadata (individual file URLs within datasets) ## Dataset Structure ### Schema | Column | Type | Nullable | Description | |--------|------|----------|-------------| | `original_id` | string | no | Dataset ID from the source portal | | `source_portal` | string | no | Portal base URL | | `portal_name` | string | no | Human-readable portal name | | `url` | string | no | Direct URL to the dataset page | | `title` | string | no | Dataset title | | `description` | string | yes | Dataset description | | `tags` | string | yes | Comma-separated tag names | | `organization` | string | yes | Publishing organization | | `license` | string | yes | License title or identifier | | `metadata_created` | string | yes | Original creation date (ISO 8601) | | `metadata_modified` | string | yes | Last modification date (ISO 8601) | | `first_seen_at` | string | no | When Ceres first indexed this dataset (RFC 3339) | | `language` | string | yes | Primary language code (e.g. `it`, `en`, `de`) | | `is_duplicate` | boolean | no | True if the same title appears in another portal | ### Files - `all.parquet` -- Complete dataset (890,143 rows) - `data/<portal>.parquet` -- Per-portal subsets (32 files) - `metadata.json` -- Snapshot metadata with counts ### Splits by portal | Portal | Country | Datasets | |--------|---------|----------| | catalog.data.gov | USA | 399,462 | | data.gov.au | Australia | 118,546 | | data.gouv.fr | France | 72,208 | | dati.gov.it | Italy | 69,603 | | open.canada.ca | Canada | 46,683 | | data.gov.ua | Ukraine | 40,055 | | data.humdata.org | International | 26,842 | | ckan.open.nrw.de | Germany | 23,353 | | data.gov.ie | Ireland | 20,677 | | ckan.opendata.swiss | Switzerland | 12,931 | | dati.toscana.it | Italy | 12,675 | | catalog.data.metro.tokyo.lg.jp | Japan | 8,609 | | discover.data.vic.gov.au | Australia | 5,451 | | dati.regione.marche.it | Italy | 5,427 | | data.gov.ro | Romania | 4,880 | | catalogue.data.gov.bc.ca | Canada | 3,334 | | dati.emilia-romagna.it | Italy | 3,118 | | opendata.aragon.es | Spain | 2,882 | | datos.gob.cl | Chile | 2,815 | | dati.comune.milano.it | Italy | 2,580 | | data.public.lu | Luxembourg | 2,505 | | dati.puglia.it | Italy | 1,782 | | dati.trentino.it | Italy | 1,377 | | dati.regione.umbria.it | Italy | 448 | | dati.lazio.it | Italy | 403 | | dati.comune.roma.it | Italy | 365 | | dati.regione.campania.it | Italy | 351 | | opendata-hro.de | Germany | 280 | | dati.regione.sicilia.it | Italy | 186 | | dati.comune.genova.it | Italy | 158 | | dati.regione.liguria.it | Italy | 124 | | dati.comune.napoli.it | Italy | 33 | ### Loading the dataset ```python import pandas as pd # Load all datasets df = pd.read_parquet("all.parquet") # Load a specific portal df_milano = pd.read_parquet("data/milano.parquet") # Filter non-duplicates unique = df[~df["is_duplicate"]] ``` Or with the `datasets` library: ```python from datasets import load_dataset ds = load_dataset("AndreaBozzo/ceres-open-data-index") ``` ## Dataset Creation ### Harvesting methodology 1. **Discovery**: Each portal's CKAN API (`/api/3/action/package_list`) provides the full list of dataset IDs 2. **Fetching**: Dataset metadata is retrieved via `/api/3/action/package_show` with concurrent requests (10 parallel, circuit breaker for resilience) 3. **Normalization**: CKAN package JSON is normalized into a flat schema with title, description, URL, and content hash (SHA-256) 4. **Incremental sync**: Only new or modified datasets are re-fetched on subsequent runs, using content hashes and CKAN's `package_search?fq=metadata_modified:[timestamp TO *]` 5. **Export**: The `ceres export --format parquet` command streams all datasets from PostgreSQL, applies curation filters, flattens JSONB metadata, detects cross-portal duplicates, and writes compressed Parquet files ### Curation The following curation steps are applied during export: - **Noise filtering** (10,331 rows removed): Titles shorter than 5 characters, titles containing "test"/"prova"/"esempio", datasets with empty descriptions - **Duplicate flagging** (75,527 rows marked): Datasets whose title (case-insensitive) appears in more than one portal are flagged with `is_duplicate=true` but kept in the dataset - **Metadata flattening**: Tags, organization, and license are extracted from nested CKAN JSON into top-level string columns ### Source data All data comes from publicly accessible CKAN API endpoints. No authentication is required. The portals are government-operated or institutional open data catalogs. ### Update frequency This snapshot was generated on **2026-03-30**. The Ceres harvester can refresh the index at any time; new snapshots are published monthly. ## Considerations ### Known biases - **Geographic skew**: The USA (catalog.data.gov) is the largest single portal (~45%), followed by Australia (~13% across federal + Victoria), France (~8%), and Italy (~11% across all Italian portals). The USA alone accounts for nearly half of the index - **Language distribution**: Predominantly English (USA, Australia, Canada, Ireland, HDX) and Italian, with significant French (France, Luxembourg, Canada), Ukrainian, German, Japanese, Spanish, and Romanian content - **Portal selection bias**: Primarily CKAN-based portals, with some DCAT-AP portals (France, Luxembourg). Major platforms using Socrata (UK) or custom APIs are not yet represented - **Aggregation duplicates**: dati.gov.it aggregates regional Italian data, so ~8% of rows are flagged as cross-portal duplicates with regional portals ### Limitations - Metadata quality varies by portal -- some datasets have minimal descriptions or missing fields - Tags and organization names are not standardized across portals - The `metadata_created` and `metadata_modified` fields are portal-reported and may not reflect actual data updates - License information is inconsistent; some portals use identifiers (`cc-by`), others use full titles (`Creative Commons Attribution 4.0`) ### Ethical considerations This dataset contains only public metadata (titles, descriptions, tags) from government open data portals. It does not contain personal data. All source portals publish data under open licenses (typically CC-BY, CC0, or equivalent). ## Additional Information ### License This dataset is released under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0), consistent with the Ceres project. The underlying metadata is sourced from portals that publish under open licenses (CC-BY, CC0, ODbL, or national open data licenses). ### Citation ```bibtex @misc{ceres-open-data-index-2026, title={Ceres Open Data Index}, author={Andrea Bozzo}, year={2026}, url={https://github.com/AndreaBozzo/Ceres}, note={Aggregated metadata from 32 open data portals, snapshot 2026-03-30} } ``` ### Links - [Ceres on GitHub](https://github.com/AndreaBozzo/Ceres) - [CKAN API documentation](https://docs.ckan.org/en/latest/api/)
提供机构:
AndreaBozzo
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作