Daksh17440/global_population_data
收藏Hugging Face2026-04-12 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Daksh17440/global_population_data
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
pretty_name: LandScan Population Raster
tags:
- population
- landscan
- timeseries
- geospatial
- hdf5
- global population
size_categories:
- 100B<n<1T
task_categories:
- feature-extraction
---
# 🌍 LandScan Global Population Dataset — `pop_2000-23.h5`
[](https://creativecommons.org/licenses/by/4.0/)
[](https://landscan.ornl.gov/)
[]()
[]()
[]()
A production-ready, chunked HDF5 tensor of **LandScan Global** annual population estimates from **2000 to 2023** — packaged for efficient use in deep learning, geospatial analysis, and HPC workflows. No more downloading 24 separate GeoTIFFs.
---
## 📦 Dataset at a Glance
| Property | Value |
|---|---|
| **Source** | LandScan Global — Oak Ridge National Laboratory (ORNL) |
| **Years covered** | 2000 – 2023 (24 time steps) |
| **Spatial resolution** | ~1 km (30 arc-seconds) |
| **Spatial extent** | Global (180°W–180°E, 90°S–90°N) |
| **Master grid** | 21,600 rows × 43,200 columns |
| **CRS** | WGS84 / EPSG:4326 |
| **Unit** | Ambient population count per pixel |
| **Data type** | float32 |
| **Format** | HDF5 (chunked + GZIP compressed) |
| **Chunk shape** | (1, 256, 256) — time × lat × lon |
> **What is LandScan?**
> LandScan represents ambient population — the average number of people present in a location over 24 hours — rather than residential census counts. It integrates census data, land cover, roads, slope, and remote sensing to model where people actually are, not just where they live.
---
## 🗂️ File Structure
```
pop_2000-23.h5
│
├── /population float32 (24, 21600, 43200) ← main data tensor
│ dim[0] → time 24 annual steps (2000–2023)
│ dim[1] → lat 21,600 latitude rows (90°N → 90°S)
│ dim[2] → lon 43,200 longitude cols (180°W → 180°E)
│
├── /coords
│ ├── years int32 (24,) [2000, 2001, …, 2023]
│ ├── lat float64 (21600,) centre latitude of each row (°N)
│ └── lon float64 (43200,) centre longitude of each col (°E)
│
├── /native_extent
│ ├── years int32 (24,) year index
│ ├── n_rows int32 (24,) native row count per year
│ └── n_cols int32 (24,) native col count per year
│
└── /stats
├── mean float32 (24,) mean over inhabited pixels
├── std float32 (24,) std over inhabited pixels
├── max float32 (24,) max pixel value per year
├── total_pop float32 (24,) global population sum per year
└── nan_fraction float32 (24,) fraction of NaN pixels per year
```
### NaN semantics
There are two distinct sources of `NaN` in this dataset:
| NaN type | Meaning |
|---|---|
| Within native extent, flagged as nodata | Ocean, permanent ice, or uninhabited area |
| Beyond native extent | That LandScan release had a smaller grid (2001–2012 were 20,880 rows) — data simply did not exist |
The `/native_extent` group tells you exactly how many rows and columns contained real data for each year, so your code can mask accordingly.
---
## ⚡ Quickstart — Partial Reads (No Full Download Needed)
HuggingFace supports [HTTP range requests](https://huggingface.co/docs/hub/datasets-adding#large-files) on `.h5` files. The HDF5 chunked layout `(1, 256, 256)` means **only the chunks you touch are transferred over the network** — you never need to download the full file.
### Install dependencies
```bash
pip install h5py numpy fsspec huggingface_hub
```
### Open the file remotely
```python
import h5py
import numpy as np
from huggingface_hub import hf_hub_url
# Stream directly from HuggingFace — no full download
url = hf_hub_url(
repo_id = "Daksh17440/landscan-global-population",
filename = "pop_2000-23.h5",
repo_type = "dataset",
)
# ROS3 driver = HTTP range-request backend for HDF5
f = h5py.File(url, "r", driver="ros3")
pop = f["population"] # shape (24, 21600, 43200) — not yet loaded
lat = f["coords/lat"][:]
lon = f["coords/lon"][:]
yrs = f["coords/years"][:] # [2000, 2001, …, 2023]
```
> **Tip:** Install `hdf5` with ROS3 support: `conda install -c conda-forge h5py` (includes it by default). For pip: `pip install h5py[ros3]`.
---
## 🔍 Usage Examples
### 1. Read a single year
```python
# Year 2020 is at index 20 (2020 - 2000 = 20)
pop_2020 = f["population"][20, :, :] # shape (21600, 43200)
# Only ~3.4 GB RAM; only touched chunks downloaded over network
```
### 2. Look up the index for any year
```python
years = f["coords/years"][:]
def year_idx(y):
idx = np.where(years == y)[0]
if len(idx) == 0:
raise ValueError(f"Year {y} not in dataset")
return int(idx[0])
pop_2015 = f["population"][year_idx(2015), :, :]
```
### 3. Spatial crop — bounding box query
```python
lat = f["coords/lat"][:]
lon = f["coords/lon"][:]
def bbox_slice(lat_min, lat_max, lon_min, lon_max):
"""Return numpy index slices for a lat/lon bounding box."""
row = np.where((lat >= lat_min) & (lat <= lat_max))[0]
col = np.where((lon >= lon_min) & (lon <= lon_max))[0]
return slice(row[0], row[-1]+1), slice(col[0], col[-1]+1)
# South Asia: 5–35°N, 65–95°E
rs, cs = bbox_slice(5, 35, 65, 95)
# Single year crop — minimal network transfer
south_asia_2023 = f["population"][23, rs, cs]
# All years crop — full time series for the region
south_asia_all = f["population"][:, rs, cs] # shape (24, ~3334, ~3334)
```
### 4. Time-range + spatial crop together
```python
# India, 2010–2020
years = f["coords/years"][:]
t_mask = np.where((years >= 2010) & (years <= 2020))[0]
rs, cs = bbox_slice(8, 37, 68, 97)
india_decade = f["population"][t_mask[0]:t_mask[-1]+1, rs, cs]
# shape: (11, H_india, W_india)
```
### 5. Country / region centroids — point time series
```python
# Population at a single point over all years (full time series)
# New Delhi: 28.6°N, 77.2°E
lat_idx = int(np.argmin(np.abs(lat - 28.6)))
lon_idx = int(np.argmin(np.abs(lon - 77.2)))
delhi_series = f["population"][:, lat_idx, lon_idx] # shape (24,)
# Extremely fast — 24 single-pixel reads
```
### 6. Global population trend (no pixel reads needed)
```python
# Pre-computed — instant, no pixel data transferred
total_pop = f["stats/total_pop"][:]
years = f["coords/years"][:]
for yr, pop in zip(years, total_pop):
print(f" {yr}: {pop/1e9:.3f} billion")
```
### 7. Use with xarray (NetCDF-style labelled arrays)
```python
import xarray as xr
import h5py
import numpy as np
with h5py.File("pop_2000-23.h5", "r") as f:
# Load a region into an xarray DataArray with named coords
rs, cs = bbox_slice(5, 35, 65, 95)
data = f["population"][:, rs, cs]
years = f["coords/years"][:]
lats = f["coords/lat"][rs]
lons = f["coords/lon"][cs]
da = xr.DataArray(
data,
dims = ["time", "lat", "lon"],
coords = {"time": years, "lat": lats, "lon": lons},
name = "population",
attrs = {"units": "persons per pixel", "source": "LandScan Global"}
)
# Now use xarray operations
annual_mean = da.mean(dim=["lat", "lon"])
trend = da.sel(time=slice(2010, 2020))
```
### 8. PyTorch — lazy streaming Dataset
```python
import h5py
import numpy as np
import torch
from torch.utils.data import Dataset
class LandScanDataset(Dataset):
"""
Streams spatial patches on demand.
Never loads the full tensor into RAM.
Parameters
----------
h5_path : local path or remote ROS3 URL
year_range : (start_year, end_year) inclusive, e.g. (2010, 2020)
bbox : (lat_min, lat_max, lon_min, lon_max) or None for global
patch_size : spatial size of each returned patch (pixels)
stride : step between patch centres
"""
def __init__(self, h5_path, year_range=(2000, 2023),
bbox=None, patch_size=256, stride=128):
self.f = h5py.File(h5_path, "r")
self.pop = self.f["population"]
years = self.f["coords/years"][:]
lat = self.f["coords/lat"][:]
lon = self.f["coords/lon"][:]
# Time axis
t_mask = np.where((years >= year_range[0]) & (years <= year_range[1]))[0]
self.t0, self.t1 = int(t_mask[0]), int(t_mask[-1]) + 1
self.T = self.t1 - self.t0
# Spatial axis
if bbox:
lat_m = np.where((lat >= bbox[0]) & (lat <= bbox[1]))[0]
lon_m = np.where((lon >= bbox[2]) & (lon <= bbox[3]))[0]
self.r0, self.r1 = int(lat_m[0]), int(lat_m[-1]) + 1
self.c0, self.c1 = int(lon_m[0]), int(lon_m[-1]) + 1
else:
self.r0, self.r1 = 0, self.pop.shape[1]
self.c0, self.c1 = 0, self.pop.shape[2]
H, W = self.r1 - self.r0, self.c1 - self.c0
self.ps = patch_size
# All valid patch top-left corners
self.patches = [
(r, c)
for r in range(0, H - patch_size, stride)
for c in range(0, W - patch_size, stride)
]
def __len__(self):
return len(self.patches) * self.T
def __getitem__(self, idx):
t_rel = idx % self.T
p_idx = idx // self.T
r, c = self.patches[p_idx]
t = self.t0 + t_rel
r_abs, c_abs = self.r0 + r, self.c0 + c
patch = self.pop[t, r_abs:r_abs+self.ps, c_abs:c_abs+self.ps]
patch = patch.astype(np.float32)
# Replace NaN with 0 for model input (or use a mask)
nan_mask = np.isnan(patch)
patch = np.nan_to_num(patch, nan=0.0)
return {
"population" : torch.from_numpy(patch[None]), # (1, ps, ps)
"nan_mask" : torch.from_numpy(nan_mask[None]), # (1, ps, ps)
"year" : torch.tensor(self.t0 + t_rel + 2000 - self.t0),
}
def __del__(self):
self.f.close()
# ── Example usage ─────────────────────────────────────────────────────────────
from torch.utils.data import DataLoader
ds = LandScanDataset(
h5_path = "pop_2000-23.h5",
year_range = (2015, 2023),
bbox = (5, 35, 65, 95), # South Asia
patch_size = 256,
stride = 128,
)
loader = DataLoader(ds, batch_size=8, shuffle=True, num_workers=4)
for batch in loader:
x = batch["population"] # (8, 1, 256, 256)
mask = batch["nan_mask"] # (8, 1, 256, 256)
yr = batch["year"]
break
```
### 9. Normalize using pre-computed stats
```python
with h5py.File("pop_2000-23.h5", "r") as f:
means = f["stats/mean"][:] # (24,) — per year
stds = f["stats/std"][:] # (24,)
years = f["coords/years"][:]
# Normalize a patch for year 2018
t = int(np.where(years == 2018)[0])
pop_2018_patch = f["population"][t, 5000:5256, 8000:8256]
normalized = (pop_2018_patch - means[t]) / (stds[t] + 1e-8)
```
### 10. HPC / MPI parallel reads
```python
# h5py supports MPI-IO for multi-node HPC jobs
# Launch with: mpirun -n 8 python script.py
from mpi4py import MPI
import h5py
import numpy as np
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
with h5py.File("pop_2000-23.h5", "r", driver="mpio", comm=comm) as f:
T = f["population"].shape[0]
my_years = np.array_split(np.arange(T), size)[rank]
for t in my_years:
slab = f["population"][t, :, :]
# Each rank independently processes its years — no contention
result = slab[~np.isnan(slab)].sum()
print(f" rank={rank} t={t} total={result/1e9:.3f}B")
```
---
## 🗺️ Native Extent per Year
Years 2001–2012 have fewer rows than the master grid because older LandScan releases used a slightly cropped polar extent. Pixels beyond the native extent are `NaN`.
```python
with h5py.File("pop_2000-23.h5", "r") as f:
ext_years = f["native_extent/years"][:]
n_rows = f["native_extent/n_rows"][:]
n_cols = f["native_extent/n_cols"][:]
for yr, h, w in zip(ext_years, n_rows, n_cols):
flag = " ← cropped" if h < 21600 else ""
print(f" {yr}: {h} × {w}{flag}")
```
Expected output:
```
2000: 21600 × 43200
2001: 20880 × 43200 ← cropped
...
2012: 20880 × 43200 ← cropped
2013: 21600 × 43200
...
2023: 21600 × 43200
```
---
## 📐 Coordinate Reference
```
Top-left pixel centre : 89.9917°N, 179.9917°W
Bottom-right pixel centre: 89.9917°S, 179.9917°E
Pixel size : 0.008333° (~0.926 km at equator, ~1 km average)
```
```python
# Convert lat/lon to row/col index
def latlon_to_idx(lat_val, lon_val, lat_arr, lon_arr):
row = int(np.argmin(np.abs(lat_arr - lat_val)))
col = int(np.argmin(np.abs(lon_arr - lon_val)))
return row, col
```
---
## ⚠️ Known Issues & Limitations
- **2001–2012 polar crop**: 720 rows missing at the poles (≥ ~83.5°N / ≤ ~83.5°S). These are ocean/ice — NaN fill has no impact on population analysis.
- **NaN ≠ zero population**: Do not fill NaN with 0 indiscriminately — ocean pixels and missing-extent pixels are both NaN but have different meanings. Use `/native_extent` to distinguish them if needed.
- **Ambient vs residential**: LandScan is *ambient* population. It differs from census residential counts — commuters, transit zones, and commercial areas inflate daytime values.
- **Population redistribution, not growth only**: Year-to-year changes reflect both demographic change and model improvements across LandScan releases.
---
## 📄 Citation
If you use this dataset, please cite the original LandScan source:
```bibtex
@dataset{landscan_global,
author = {Oak Ridge National Laboratory},
title = {LandScan Global Population Database},
year = {2023},
publisher = {Oak Ridge National Laboratory},
url = {https://landscan.ornl.gov},
note = {Annual releases 2000--2023}
}
```
---
## 🔗 Links
- [LandScan Global — ORNL](https://landscan.ornl.gov)
- [LandScan Methodology](https://landscan.ornl.gov/citations)
- [HDF5 Chunking Guide](https://docs.h5py.org/en/stable/high/dataset.html#chunked-storage)
- [h5py ROS3 (remote streaming)](https://docs.h5py.org/en/stable/high/file.html#ros3)
---
## 📜 License
The original LandScan Global data is made available under the [Creative Commons Attribution 4.0 International License (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).
Attribution: *UT-Battelle, LLC, Oak Ridge National Laboratory*.
This HDF5 repackaging does not alter the underlying data.
提供机构:
Daksh17440



