five

theodi/ndl-core-structured-data

收藏
Hugging Face2026-01-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/theodi/ndl-core-structured-data
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: NDL Core Structured Data size_categories: - 100M<n<1B --- # NDL Core – Structured Data ## Overview **NDL Core – Structured Data** is a curated collection of structured UK public sector datasets, converted into **Apache Parquet** format for efficient analytics and machine learning workflows. This repository is part of the broader [NDL Core Corpus](https://huggingface.co/datasets/theodi/ndl-core-corpus), which combines both **textual** and **structured** data sourced from authoritative UK government and public sector platforms. Textual sources (e.g. GOV.UK, Hansard, legislation.gov.uk) are hosted separately, while this repository focuses exclusively on **tabular / structured datasets**. The goal of this dataset is to provide a **clean, analysis-ready foundation** for research, policy analysis, data science, and downstream AI applications. --- ## Data Sources The structured data in this repository has been collected from the following UK public sector sources: ### 1. data.gov.uk * Top 10 most recent datasets per category at the time of crawling * Covers a wide range of domains including transport, environment, health, education, and government operations * Original formats varied (CSV, XLSX, JSON), all normalized and converted to Parquet ### 2. Office for National Statistics (ONS) * Official UK statistics on population, economy, labour market, health, and more * Includes high-value national datasets frequently used in research and policymaking ### 3. Defra (Department for Environment, Food & Rural Affairs) * Environmental, agricultural, and rural data * Includes datasets related to land use, climate, farming, and natural resources --- ## Intended Use Cases * Policy analysis and evaluation * Socioeconomic and environmental research * Public sector analytics * Feature engineering for ML models * Retrieval-augmented generation (RAG) pipelines * Data integration with textual government corpora --- ## Relationship to NDL Core Corpus This repository contains **only structured data**. Related textual datasets (government guidance, parliamentary debates, legislation) are hosted separately in the [NDL Core Corpus](https://huggingface.co/datasets/theodi/ndl-core-corpus) repository. Together, these datasets enable **hybrid structured + unstructured analysis** across UK public sector information. --- ## Limitations * Data reflects the state of the source datasets at crawl time * Some datasets may be incomplete, deprecated, or superseded upstream * Schema consistency varies across sources * No guarantee of real-time updates
提供机构:
theodi
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作