five

Daksh17440/global_population_data

收藏
Hugging Face2026-04-12 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Daksh17440/global_population_data
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en pretty_name: LandScan Population Raster tags: - population - landscan - timeseries - geospatial - hdf5 - global population size_categories: - 100B<n<1T task_categories: - feature-extraction --- # 🌍 LandScan Global Population Dataset — `pop_2000-23.h5` [![License](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/) [![Source](https://img.shields.io/badge/Source-Oak%20Ridge%20National%20Laboratory-blue)](https://landscan.ornl.gov/) [![Resolution](https://img.shields.io/badge/Resolution-1%20km-green)]() [![Years](https://img.shields.io/badge/Years-2000--2023-orange)]() [![Format](https://img.shields.io/badge/Format-HDF5-red)]() A production-ready, chunked HDF5 tensor of **LandScan Global** annual population estimates from **2000 to 2023** — packaged for efficient use in deep learning, geospatial analysis, and HPC workflows. No more downloading 24 separate GeoTIFFs. --- ## 📦 Dataset at a Glance | Property | Value | |---|---| | **Source** | LandScan Global — Oak Ridge National Laboratory (ORNL) | | **Years covered** | 2000 – 2023 (24 time steps) | | **Spatial resolution** | ~1 km (30 arc-seconds) | | **Spatial extent** | Global (180°W–180°E, 90°S–90°N) | | **Master grid** | 21,600 rows × 43,200 columns | | **CRS** | WGS84 / EPSG:4326 | | **Unit** | Ambient population count per pixel | | **Data type** | float32 | | **Format** | HDF5 (chunked + GZIP compressed) | | **Chunk shape** | (1, 256, 256) — time × lat × lon | > **What is LandScan?** > LandScan represents ambient population — the average number of people present in a location over 24 hours — rather than residential census counts. It integrates census data, land cover, roads, slope, and remote sensing to model where people actually are, not just where they live. --- ## 🗂️ File Structure ``` pop_2000-23.h5 │ ├── /population float32 (24, 21600, 43200) ← main data tensor │ dim[0] → time 24 annual steps (2000–2023) │ dim[1] → lat 21,600 latitude rows (90°N → 90°S) │ dim[2] → lon 43,200 longitude cols (180°W → 180°E) │ ├── /coords │ ├── years int32 (24,) [2000, 2001, …, 2023] │ ├── lat float64 (21600,) centre latitude of each row (°N) │ └── lon float64 (43200,) centre longitude of each col (°E) │ ├── /native_extent │ ├── years int32 (24,) year index │ ├── n_rows int32 (24,) native row count per year │ └── n_cols int32 (24,) native col count per year │ └── /stats ├── mean float32 (24,) mean over inhabited pixels ├── std float32 (24,) std over inhabited pixels ├── max float32 (24,) max pixel value per year ├── total_pop float32 (24,) global population sum per year └── nan_fraction float32 (24,) fraction of NaN pixels per year ``` ### NaN semantics There are two distinct sources of `NaN` in this dataset: | NaN type | Meaning | |---|---| | Within native extent, flagged as nodata | Ocean, permanent ice, or uninhabited area | | Beyond native extent | That LandScan release had a smaller grid (2001–2012 were 20,880 rows) — data simply did not exist | The `/native_extent` group tells you exactly how many rows and columns contained real data for each year, so your code can mask accordingly. --- ## ⚡ Quickstart — Partial Reads (No Full Download Needed) HuggingFace supports [HTTP range requests](https://huggingface.co/docs/hub/datasets-adding#large-files) on `.h5` files. The HDF5 chunked layout `(1, 256, 256)` means **only the chunks you touch are transferred over the network** — you never need to download the full file. ### Install dependencies ```bash pip install h5py numpy fsspec huggingface_hub ``` ### Open the file remotely ```python import h5py import numpy as np from huggingface_hub import hf_hub_url # Stream directly from HuggingFace — no full download url = hf_hub_url( repo_id = "Daksh17440/landscan-global-population", filename = "pop_2000-23.h5", repo_type = "dataset", ) # ROS3 driver = HTTP range-request backend for HDF5 f = h5py.File(url, "r", driver="ros3") pop = f["population"] # shape (24, 21600, 43200) — not yet loaded lat = f["coords/lat"][:] lon = f["coords/lon"][:] yrs = f["coords/years"][:] # [2000, 2001, …, 2023] ``` > **Tip:** Install `hdf5` with ROS3 support: `conda install -c conda-forge h5py` (includes it by default). For pip: `pip install h5py[ros3]`. --- ## 🔍 Usage Examples ### 1. Read a single year ```python # Year 2020 is at index 20 (2020 - 2000 = 20) pop_2020 = f["population"][20, :, :] # shape (21600, 43200) # Only ~3.4 GB RAM; only touched chunks downloaded over network ``` ### 2. Look up the index for any year ```python years = f["coords/years"][:] def year_idx(y): idx = np.where(years == y)[0] if len(idx) == 0: raise ValueError(f"Year {y} not in dataset") return int(idx[0]) pop_2015 = f["population"][year_idx(2015), :, :] ``` ### 3. Spatial crop — bounding box query ```python lat = f["coords/lat"][:] lon = f["coords/lon"][:] def bbox_slice(lat_min, lat_max, lon_min, lon_max): """Return numpy index slices for a lat/lon bounding box.""" row = np.where((lat >= lat_min) & (lat <= lat_max))[0] col = np.where((lon >= lon_min) & (lon <= lon_max))[0] return slice(row[0], row[-1]+1), slice(col[0], col[-1]+1) # South Asia: 5–35°N, 65–95°E rs, cs = bbox_slice(5, 35, 65, 95) # Single year crop — minimal network transfer south_asia_2023 = f["population"][23, rs, cs] # All years crop — full time series for the region south_asia_all = f["population"][:, rs, cs] # shape (24, ~3334, ~3334) ``` ### 4. Time-range + spatial crop together ```python # India, 2010–2020 years = f["coords/years"][:] t_mask = np.where((years >= 2010) & (years <= 2020))[0] rs, cs = bbox_slice(8, 37, 68, 97) india_decade = f["population"][t_mask[0]:t_mask[-1]+1, rs, cs] # shape: (11, H_india, W_india) ``` ### 5. Country / region centroids — point time series ```python # Population at a single point over all years (full time series) # New Delhi: 28.6°N, 77.2°E lat_idx = int(np.argmin(np.abs(lat - 28.6))) lon_idx = int(np.argmin(np.abs(lon - 77.2))) delhi_series = f["population"][:, lat_idx, lon_idx] # shape (24,) # Extremely fast — 24 single-pixel reads ``` ### 6. Global population trend (no pixel reads needed) ```python # Pre-computed — instant, no pixel data transferred total_pop = f["stats/total_pop"][:] years = f["coords/years"][:] for yr, pop in zip(years, total_pop): print(f" {yr}: {pop/1e9:.3f} billion") ``` ### 7. Use with xarray (NetCDF-style labelled arrays) ```python import xarray as xr import h5py import numpy as np with h5py.File("pop_2000-23.h5", "r") as f: # Load a region into an xarray DataArray with named coords rs, cs = bbox_slice(5, 35, 65, 95) data = f["population"][:, rs, cs] years = f["coords/years"][:] lats = f["coords/lat"][rs] lons = f["coords/lon"][cs] da = xr.DataArray( data, dims = ["time", "lat", "lon"], coords = {"time": years, "lat": lats, "lon": lons}, name = "population", attrs = {"units": "persons per pixel", "source": "LandScan Global"} ) # Now use xarray operations annual_mean = da.mean(dim=["lat", "lon"]) trend = da.sel(time=slice(2010, 2020)) ``` ### 8. PyTorch — lazy streaming Dataset ```python import h5py import numpy as np import torch from torch.utils.data import Dataset class LandScanDataset(Dataset): """ Streams spatial patches on demand. Never loads the full tensor into RAM. Parameters ---------- h5_path : local path or remote ROS3 URL year_range : (start_year, end_year) inclusive, e.g. (2010, 2020) bbox : (lat_min, lat_max, lon_min, lon_max) or None for global patch_size : spatial size of each returned patch (pixels) stride : step between patch centres """ def __init__(self, h5_path, year_range=(2000, 2023), bbox=None, patch_size=256, stride=128): self.f = h5py.File(h5_path, "r") self.pop = self.f["population"] years = self.f["coords/years"][:] lat = self.f["coords/lat"][:] lon = self.f["coords/lon"][:] # Time axis t_mask = np.where((years >= year_range[0]) & (years <= year_range[1]))[0] self.t0, self.t1 = int(t_mask[0]), int(t_mask[-1]) + 1 self.T = self.t1 - self.t0 # Spatial axis if bbox: lat_m = np.where((lat >= bbox[0]) & (lat <= bbox[1]))[0] lon_m = np.where((lon >= bbox[2]) & (lon <= bbox[3]))[0] self.r0, self.r1 = int(lat_m[0]), int(lat_m[-1]) + 1 self.c0, self.c1 = int(lon_m[0]), int(lon_m[-1]) + 1 else: self.r0, self.r1 = 0, self.pop.shape[1] self.c0, self.c1 = 0, self.pop.shape[2] H, W = self.r1 - self.r0, self.c1 - self.c0 self.ps = patch_size # All valid patch top-left corners self.patches = [ (r, c) for r in range(0, H - patch_size, stride) for c in range(0, W - patch_size, stride) ] def __len__(self): return len(self.patches) * self.T def __getitem__(self, idx): t_rel = idx % self.T p_idx = idx // self.T r, c = self.patches[p_idx] t = self.t0 + t_rel r_abs, c_abs = self.r0 + r, self.c0 + c patch = self.pop[t, r_abs:r_abs+self.ps, c_abs:c_abs+self.ps] patch = patch.astype(np.float32) # Replace NaN with 0 for model input (or use a mask) nan_mask = np.isnan(patch) patch = np.nan_to_num(patch, nan=0.0) return { "population" : torch.from_numpy(patch[None]), # (1, ps, ps) "nan_mask" : torch.from_numpy(nan_mask[None]), # (1, ps, ps) "year" : torch.tensor(self.t0 + t_rel + 2000 - self.t0), } def __del__(self): self.f.close() # ── Example usage ───────────────────────────────────────────────────────────── from torch.utils.data import DataLoader ds = LandScanDataset( h5_path = "pop_2000-23.h5", year_range = (2015, 2023), bbox = (5, 35, 65, 95), # South Asia patch_size = 256, stride = 128, ) loader = DataLoader(ds, batch_size=8, shuffle=True, num_workers=4) for batch in loader: x = batch["population"] # (8, 1, 256, 256) mask = batch["nan_mask"] # (8, 1, 256, 256) yr = batch["year"] break ``` ### 9. Normalize using pre-computed stats ```python with h5py.File("pop_2000-23.h5", "r") as f: means = f["stats/mean"][:] # (24,) — per year stds = f["stats/std"][:] # (24,) years = f["coords/years"][:] # Normalize a patch for year 2018 t = int(np.where(years == 2018)[0]) pop_2018_patch = f["population"][t, 5000:5256, 8000:8256] normalized = (pop_2018_patch - means[t]) / (stds[t] + 1e-8) ``` ### 10. HPC / MPI parallel reads ```python # h5py supports MPI-IO for multi-node HPC jobs # Launch with: mpirun -n 8 python script.py from mpi4py import MPI import h5py import numpy as np comm = MPI.COMM_WORLD rank = comm.Get_rank() size = comm.Get_size() with h5py.File("pop_2000-23.h5", "r", driver="mpio", comm=comm) as f: T = f["population"].shape[0] my_years = np.array_split(np.arange(T), size)[rank] for t in my_years: slab = f["population"][t, :, :] # Each rank independently processes its years — no contention result = slab[~np.isnan(slab)].sum() print(f" rank={rank} t={t} total={result/1e9:.3f}B") ``` --- ## 🗺️ Native Extent per Year Years 2001–2012 have fewer rows than the master grid because older LandScan releases used a slightly cropped polar extent. Pixels beyond the native extent are `NaN`. ```python with h5py.File("pop_2000-23.h5", "r") as f: ext_years = f["native_extent/years"][:] n_rows = f["native_extent/n_rows"][:] n_cols = f["native_extent/n_cols"][:] for yr, h, w in zip(ext_years, n_rows, n_cols): flag = " ← cropped" if h < 21600 else "" print(f" {yr}: {h} × {w}{flag}") ``` Expected output: ``` 2000: 21600 × 43200 2001: 20880 × 43200 ← cropped ... 2012: 20880 × 43200 ← cropped 2013: 21600 × 43200 ... 2023: 21600 × 43200 ``` --- ## 📐 Coordinate Reference ``` Top-left pixel centre : 89.9917°N, 179.9917°W Bottom-right pixel centre: 89.9917°S, 179.9917°E Pixel size : 0.008333° (~0.926 km at equator, ~1 km average) ``` ```python # Convert lat/lon to row/col index def latlon_to_idx(lat_val, lon_val, lat_arr, lon_arr): row = int(np.argmin(np.abs(lat_arr - lat_val))) col = int(np.argmin(np.abs(lon_arr - lon_val))) return row, col ``` --- ## ⚠️ Known Issues & Limitations - **2001–2012 polar crop**: 720 rows missing at the poles (≥ ~83.5°N / ≤ ~83.5°S). These are ocean/ice — NaN fill has no impact on population analysis. - **NaN ≠ zero population**: Do not fill NaN with 0 indiscriminately — ocean pixels and missing-extent pixels are both NaN but have different meanings. Use `/native_extent` to distinguish them if needed. - **Ambient vs residential**: LandScan is *ambient* population. It differs from census residential counts — commuters, transit zones, and commercial areas inflate daytime values. - **Population redistribution, not growth only**: Year-to-year changes reflect both demographic change and model improvements across LandScan releases. --- ## 📄 Citation If you use this dataset, please cite the original LandScan source: ```bibtex @dataset{landscan_global, author = {Oak Ridge National Laboratory}, title = {LandScan Global Population Database}, year = {2023}, publisher = {Oak Ridge National Laboratory}, url = {https://landscan.ornl.gov}, note = {Annual releases 2000--2023} } ``` --- ## 🔗 Links - [LandScan Global — ORNL](https://landscan.ornl.gov) - [LandScan Methodology](https://landscan.ornl.gov/citations) - [HDF5 Chunking Guide](https://docs.h5py.org/en/stable/high/dataset.html#chunked-storage) - [h5py ROS3 (remote streaming)](https://docs.h5py.org/en/stable/high/file.html#ros3) --- ## 📜 License The original LandScan Global data is made available under the [Creative Commons Attribution 4.0 International License (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/). Attribution: *UT-Battelle, LLC, Oak Ridge National Laboratory*. This HDF5 repackaging does not alter the underlying data.
提供机构:
Daksh17440
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作