theodi/ndl-core-structured-data
收藏Hugging Face2026-01-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/theodi/ndl-core-structured-data
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: NDL Core Structured Data
size_categories:
- 100M<n<1B
---
# NDL Core – Structured Data
## Overview
**NDL Core – Structured Data** is a curated collection of structured UK public sector datasets, converted into **Apache Parquet** format for efficient analytics and machine learning workflows.
This repository is part of the broader [NDL Core Corpus](https://huggingface.co/datasets/theodi/ndl-core-corpus), which combines both **textual** and **structured** data sourced from authoritative UK government and public sector platforms.
Textual sources (e.g. GOV.UK, Hansard, legislation.gov.uk) are hosted separately, while this repository focuses exclusively on **tabular / structured datasets**.
The goal of this dataset is to provide a **clean, analysis-ready foundation** for research, policy analysis, data science, and downstream AI applications.
---
## Data Sources
The structured data in this repository has been collected from the following UK public sector sources:
### 1. data.gov.uk
* Top 10 most recent datasets per category at the time of crawling
* Covers a wide range of domains including transport, environment, health, education, and government operations
* Original formats varied (CSV, XLSX, JSON), all normalized and converted to Parquet
### 2. Office for National Statistics (ONS)
* Official UK statistics on population, economy, labour market, health, and more
* Includes high-value national datasets frequently used in research and policymaking
### 3. Defra (Department for Environment, Food & Rural Affairs)
* Environmental, agricultural, and rural data
* Includes datasets related to land use, climate, farming, and natural resources
---
## Intended Use Cases
* Policy analysis and evaluation
* Socioeconomic and environmental research
* Public sector analytics
* Feature engineering for ML models
* Retrieval-augmented generation (RAG) pipelines
* Data integration with textual government corpora
---
## Relationship to NDL Core Corpus
This repository contains **only structured data**.
Related textual datasets (government guidance, parliamentary debates, legislation) are hosted separately in the [NDL Core Corpus](https://huggingface.co/datasets/theodi/ndl-core-corpus) repository.
Together, these datasets enable **hybrid structured + unstructured analysis** across UK public sector information.
---
## Limitations
* Data reflects the state of the source datasets at crawl time
* Some datasets may be incomplete, deprecated, or superseded upstream
* Schema consistency varies across sources
* No guarantee of real-time updates
提供机构:
theodi



