Loveffort/Capstone-dataset
收藏Hugging Face2026-03-01 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Loveffort/Capstone-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- geospatial
- tabular
pretty_name: VIC Property Risk Assessment Dataset
size_categories:
- 10G<n<100G
---
## Dataset Summary
This dataset provides property-level risk assessments for properties across the state of Victoria, Australia.
It currently focuses on fire risk, with plans to expand to other hazard types (e.g., flood).
This dataset combines geospatial property boundaries with machine learning-derived risk scores,
designed to identify opportunities for renewable energy projects.
## Data Structure & File Organization
The dataset is organized into modular GeoDataFrame components, each representing a distinct data layer. All layers follow consistent spatial reference standards:
- `X`/`Y` fields (latitude/longitude): WGS84 (EPSG:4326)
- `geometry` fields (Point/MultiPolygon/Polygon): GDA2020 / MGA zone 55 (EPSG:7855)
### 1. Property
To load the shapefile successfully, ensure the following files are in the **same directory**:
- `PROPERTY_VIEW.shp` (geometric data)
- `PROPERTY_VIEW.dbf` (attribute table, **required**)
- `PROPERTY_VIEW.shx` (spatial index, **required**)
#### Technical Specifications
- **Original Coordinate Reference System (CRS)**: EPSG:7899 (GDA2020 / VicGrid)
- **File Format**: ESRI Shapefile
| File Name | Format | Description |
|-----------|--------|-------------|
| `PROPERTY_VIEW.shp` | SHP | Geospatial vector data containing property boundary geometries |
| `PROPERTY_VIEW.dbf` | DBF | Attribute table with core property characteristics (ID, location, land use, etc.) |
| `PROPERTY_VIEW.shx` | SHX | Spatial index file for optimized spatial queries |
| `PROPERTY_VIEW.prj` | PRJ | Projection definition (EPSG:7899) |
| `PROPERTY_VIEW.cpg` | CPG | Character encoding specification (UTF-8) |
#### Key Fields
| Field Name | Data Type | Description |
|------------|-----------|-------------|
| `UFI_CREATED` | Date | Timestamp when the Unique Feature Identifier (UFI) was created |
| `BASE_PFI` | String | Persistent Feature Identifier of the base/parent parcel (for split/merged parcels) |
| `STATUS` | String | Current status of the property feature (e.g., active, retired) |
| `PFI` | String | Primary stable unique ID for the property parcel (core link across datasets) |
| `Z_LEVEL` | String | Z-axis level/elevation classification (linked to PR_Z_LEVEL reference table) |
| `UFI` | Numeric | Unique Feature Identifier (broader spatial feature ID, not parcel-specific) |
| `UFI_OLD` | Numeric | Previous UFI value for historical tracking |
| `TASK_ID` | Numeric | Task ID associated with the last record update |
| `CENTROID_PFI` | String | PFI of the centroid point for this property boundary |
| `GRAPHIC_TYPE` | String | Graphic representation type (linked to PR_GRAPHIC_TYPE reference table) |
| `PFI_CREATED` | Date | Timestamp when the Persistent Feature Identifier (PFI) was created |
---
### 2. Metro Address Points (`metro_gdf`)
This layer contains geocoded address points for metropolitan Victoria, with GNAF (Geocoded National Address File) matching metrics.
#### Key Fields
| Field Name | Data Type | Description |
|------------|-----------|-------------|
| `X` | Float | Longitude coordinate (WGS84) |
| `Y` | Float | Latitude coordinate (WGS84) |
| `gnaf_confidence` | Float | Confidence score for GNAF address matching (0–2) |
| `distance_to_gnaf` | Float | Distance (meters) between address point and matched GNAF record |
| `gnaf_lat` | Float | Latitude from matched GNAF record |
| `gnaf_long` | Float | Longitude from matched GNAF record |
| `geometry` | Point | Geospatial point geometry |
---
### 3. Rural Address Points (`rural_gdf`)
This layer contains geocoded address points for rural Victoria, with GNAF matching metrics.
#### Key Fields
| Field Name | Data Type | Description |
|------------|-----------|-------------|
| `X` | Float | Longitude coordinate (WGS84) |
| `Y` | Float | Latitude coordinate (WGS84) |
| `gnaf_confidence` | Float | Confidence score for GNAF address matching (0–2) |
| `distance_to_gnaf` | Float | Distance (meters) between address point and matched GNAF record |
| `gnaf_lat` | Float | Latitude from matched GNAF record |
| `gnaf_long` | Float | Longitude from matched GNAF record |
| `geometry` | Point | Geospatial point geometry |
---
### 4. Bushfire Planning Area (`bushfire_gdf`)
This layer contains bushfire planning areas with associated area metrics for hazard assessment.
#### Key Fields
| Field Name | Data Type | Description |
|------------|-----------|-------------|
| `bpa_areaha` | Float | Area of the bushfire planning area (hectares) |
| `geometry` | MultiPolygon | Geospatial polygon geometry |
---
### 5. Historical Fire Events (`fire_history_gdf`)
This layer contains historical bushfire event boundaries with frequency and burn metrics.
#### Key Fields
| Field Name | Data Type | Description |
|------------|-----------|-------------|
| `firecount` | Integer | Number of recorded fires in the area |
| `burncount` | Integer | Number of recorded burn events in the area |
| `allcount` | Integer | Total count of fire and burn events |
| `yrsfrburn` | Integer | Years since the last recorded burn |
| `geometry` | MultiPolygon | Geospatial polygon geometry |
---
### 6. Fire Management Zones (`fire_manage_gdf`)
This layer contains fire management zones with zoning type classifications.
#### Key Fields
| Field Name | Data Type | Description |
|------------|-----------|-------------|
| `zonetype` | Float | Classification code for fire management zone type (0.0, 1.0, 3.0) |
| `geometry` | MultiPolygon | Geospatial polygon geometry |
---
### 7. Renewable Energy Sites (`renewable_gdf`)
This layer contains renewable energy project locations with geographic coordinates.
#### Key Fields
| Field Name | Data Type | Description |
|------------|-----------|-------------|
| `Y` | Float | Latitude coordinate (WGS84) |
| `X` | Float | Longitude coordinate (WGS84) |
| `geometry` | Point | Geospatial point geometry |
---
### 8. Transmission Stations (`transmission_station_gdf`)
This layer contains electrical transmission stations with voltage ratings.
#### Key Fields
| Field Name | Data Type | Description |
|------------|-----------|-------------|
| `voltage` | Integer | Voltage rating of the transmission station (kV: 66, 110, 220, 400) |
| `geometry` | Point | Geospatial point geometry |
---
### 9. Native Vegetation (`native_veg_gdf`)
This layer contains native vegetation polygons with ecosystem classification and area metrics.
#### Key Fields
| Field Name | Data Type | Description |
|------------|-----------|-------------|
| `evc_bcs` | String | Bioclimatic stratum classification (E, LC, V) |
| `evc_mut` | String | Ecological Vegetation Class (EVC) type (mosaic, EVC) |
| `areasqm` | Float | Area of the vegetation polygon (square meters) |
| `xgroupname` | String | Ecosystem group name (e.g., Dry Forests, Mallee) |
| `geometry` | MultiPolygon | Geospatial polygon geometry |
---
### 10. Station-Property Proximity (`station_property_gdf`)
This layer links property boundaries to nearby transmission stations, with distance metrics.
#### Key Fields
| Field Name | Data Type | Description |
|------------|-----------|-------------|
| `PFI` | String | Persistent Feature Identifier – primary key linking to property boundaries |
| `station_id` | Integer | Unique identifier for the transmission station |
| `distance_to_station_km` | Float | Distance from the property to the transmission station (kilometers) |
| `geometry` | Polygon/MultiPolygon | Geospatial polygon geometry |
---
### 11. Feature Vectors (`x_vector`)
This layer contains normalized feature vectors used for machine learning model training, indexed by PFI (Persistent Feature Identifier).
#### Key Fields
| Field Name | Data Type | Description |
|------------|-----------|-------------|
| `total_facilities_5km` | Float | Normalized count of facilities within 5km of the property |
| `closest_facility_distance` | Float | Normalized distance to the closest facility (meters) |
| `is_prone` | Integer | Binary indicator (1 = fire-prone, 0 = non-fire-prone) |
| `type0` | Integer | Binary classification for property type category |
| `veg_area` | Float | Normalized area of native vegetation on the property (square meters) |
| `evc_mut_0` | Integer | Binary indicator for Ecological Vegetation Class (EVC) type category |
| `evc_bcs_0` | Integer | Binary indicator for bioclimatic stratum classification category |
| `xgroup_0` | Integer | Binary indicator for ecosystem group category |
| `fire_count` | Float | Normalized count of historical fire events in the area |
| `yrs_since_last_burn` | Float | Normalized years since the last recorded burn event |
| `PFI` | String | Index: Persistent Feature Identifier (primary key) |
---
### 12. Risk Labels (`y_labels`)
This layer contains binary risk labels (high/low) for supervised model training, indexed by PFI.
#### Key Fields
| Field Name | Data Type | Description |
|------------|-----------|-------------|
| `is_high_risk` | Integer | Binary indicator (1 = high fire risk, 0 = not high risk) |
| `is_low_risk` | Integer | Binary indicator (1 = low fire risk/renewable potential, 0 = not low risk) |
| `PFI` | String | Index: Persistent Feature Identifier (primary key) |
---
### 13. Risk Prediction Results (`risk_results`)
This layer contains the final fire risk prediction outputs for all properties (labeled + unlabeled), indexed by PFI.
#### Key Fields
| Field Name | Data Type | Description |
|------------|-----------|-------------|
| `risk_probability` | Float | Predicted probability of high fire risk (0.0–1.0, rounded to 4 decimal places) |
| `risk_score` | Float | Risk probability scaled to 0–100 (rounded to 1 decimal place) |
| `risk_level` | String | Categorical risk level (Low: 0–30, Medium: 30–80, High: 80–100) |
| `PFI` | String | Index: Persistent Feature Identifier (primary key) |
## Usage Guide
#### 1. Basic Setup
```python
import os
import inspect
from huggingface_hub import HfApi, hf_hub_download
# Core configuration
HF_REPO_ID = "Loveffort/Capstone-dataset"
HF_TOKEN = os.getenv("HF_TOKEN")
IS_COLAB = "COLAB_GPU" in os.environ
WORK_DIR = "/content/victoria_fire_risk_data" if IS_COLAB else "./victoria_fire_risk_data"
os.makedirs(WORK_DIR, exist_ok=True)
```
#### 2. Initialize HFDataManager
```python
class HFDataManager:
def __init__(self, repo_id, token, work_dir):
self.repo_id = repo_id
self.token = token
self.work_dir = work_dir
self.api = HfApi(token=token) if token else None
def load_geo_data(self, hf_file_path):
if hf_file_path.lower().endswith(".shp"):
base = hf_file_path.replace(".shp", "")
local_files = {}
for ext in [".shp", ".shx", ".dbf", ".prj", ".cpg"]:
filename = f"{base}{ext}"
try:
local_file = hf_hub_download(
repo_id=self.repo_id,
repo_type="dataset",
filename=filename,
token=self.token,
cache_dir=self.work_dir
)
local_files[ext] = local_file
except Exception as e:
print(f"Warning: {filename} not found ({e})")
shp_path = local_files.get(".shp")
if not shp_path:
raise FileNotFoundError("Shapefile .shp not found in repo")
gdf = gpd.read_file(shp_path)
else:
local_file = hf_hub_download(
repo_id=self.repo_id,
repo_type="dataset",
filename=hf_file_path,
token=self.token,
cache_dir=self.work_dir
)
if hf_file_path.endswith(".parquet"):
gdf = gpd.read_parquet(local_file)
elif hf_file_path.endswith(".geojson"):
gdf = gpd.read_file(local_file)
elif hf_file_path.endswith(".csv"):
gdf = gpd.read_file(local_file, driver="CSV")
else:
raise ValueError("Unsupported file format")
return gdf
def save_gdf(self, gdf, format="parquet"):
if not self.api:
raise ValueError("HF Token not configured!")
frame = inspect.currentframe().f_back
file_name = [k for k, v in frame.f_locals.items() if v is gdf][0]
file_fullname = f"{file_name}.{format}"
local_path = os.path.join(self.work_dir, file_fullname)
if format == "parquet":
gdf.to_parquet(local_path)
elif format == "csv":
gdf.to_csv(local_path, index=True)
elif format == "geojson":
gdf.to_file(local_path, driver="GeoJSON")
else:
raise ValueError("Unsupported file format")
self.api.upload_file(
path_or_fileobj=local_path,
path_in_repo=file_fullname,
repo_id=self.repo_id,
repo_type="dataset",
commit_message=f"Save {file_fullname}"
)
print(f"Uploaded to HF Hub: {self.repo_id}/{file_fullname}")
hf_manager = HFDataManager(HF_REPO_ID, HF_TOKEN, WORK_DIR)
```
### 3. Examples
```python
metro_gdf = hf_manager.load_geo_data("metro_gdf.parquet")
hf_manager.save_gdf(metro_gdf,format="parquet") #default parquet
许可证:Apache-2.0
任务类别:
- 地理空间
- 表格数据
展示名称:维多利亚房产风险评估数据集
数据量类别:
- 10G < 数据集大小 < 100G
---
## 数据集概述
本数据集为澳大利亚维多利亚州全域范围内的房产提供逐房产级别的风险评估。目前该数据集聚焦火灾风险,未来计划拓展至其他灾害类型(如洪水)。本数据集将地理空间房产边界与机器学习生成的风险评分相结合,旨在识别可再生能源项目的开发机遇。
## 数据结构与文件组织
本数据集采用模块化地理空间数据框(GeoDataFrame)组件进行组织,每个组件对应一个独立的数据图层。所有图层均遵循统一的空间参考标准:
- `X`/`Y`字段(纬度/经度):WGS84(EPSG:4326)
- `geometry`字段(点/多面/多边形):GDA2020 / MGA第55带(EPSG:7855)
---
### 1. 房产图层
若需成功加载形状文件,请确保以下文件位于**同一目录**下:
- `PROPERTY_VIEW.shp`(几何数据)
- `PROPERTY_VIEW.dbf`(属性表,**必填**)
- `PROPERTY_VIEW.shx`(空间索引,**必填**)
#### 技术规格
- **原始坐标参考系统(CRS)**:EPSG:7899(GDA2020 / VicGrid)
- **文件格式**:ESRI形状文件(ESRI Shapefile)
| 文件名 | 格式 | 描述 |
|-----------|--------|-------------|
| `PROPERTY_VIEW.shp` | SHP | 地理空间矢量数据,包含房产边界几何信息 |
| `PROPERTY_VIEW.dbf` | DBF | 属性表,包含核心房产特征(ID、位置、土地用途等) |
| `PROPERTY_VIEW.shx` | SHX | 用于优化空间查询的空间索引文件 |
| `PROPERTY_VIEW.prj` | PRJ | 投影定义文件(EPSG:7899) |
| `PROPERTY_VIEW.cpg` | CPG | 字符编码规范(UTF-8) |
#### 核心字段
| 字段名 | 数据类型 | 描述 |
|------------|-----------|-------------|
| `UFI_CREATED` | 日期 | 唯一特征标识符(Unique Feature Identifier, UFI)的创建时间戳 |
| `BASE_PFI` | 字符串 | 基础/父宗地的持久特征标识符(Persistent Feature Identifier, PFI)(用于拆分/合并的宗地) |
| `STATUS` | 字符串 | 房产要素的当前状态(例如:有效、已失效) |
| `PFI` | 字符串 | 房产宗地的唯一稳定ID(跨数据集的核心关联键) |
| `Z_LEVEL` | 字符串 | Z轴层级/高程分类(关联PR_Z_LEVEL参考表) |
| `UFI` | 数值 | 唯一特征标识符(更宽泛的空间要素ID,非宗地专属) |
| `UFI_OLD` | 数值 | 历史追踪用的旧UFI值 |
| `TASK_ID` | 数值 | 上次记录更新关联的任务ID |
| `CENTROID_PFI` | 字符串 | 本房产边界质心点的PFI |
| `GRAPHIC_TYPE` | 字符串 | 图形表示类型(关联PR_GRAPHIC_TYPE参考表) |
| `PFI_CREATED` | 日期 | 持久特征标识符(PFI)的创建时间戳 |
---
### 2. 大都会地址点(`metro_gdf`)
本图层包含维多利亚州大都会区域的地理编码地址点,附带地理编码国家地址文件(Geocoded National Address File, GNAF)匹配指标。
#### 核心字段
| 字段名 | 数据类型 | 描述 |
|------------|-----------|-------------|
| `X` | 浮点型 | 经度坐标(WGS84) |
| `Y` | 浮点型 | 纬度坐标(WGS84) |
| `gnaf_confidence` | 浮点型 | GNAF地址匹配置信度评分(0–2) |
| `distance_to_gnaf` | 浮点型 | 地址点与匹配的GNAF记录之间的距离(单位:米) |
| `gnaf_lat` | 浮点型 | 匹配的GNAF记录的纬度 |
| `gnaf_long` | 浮点型 | 匹配的GNAF记录的经度 |
| `geometry` | 点 | 地理空间点几何 |
---
### 3. 乡村地址点(`rural_gdf`)
本图层包含维多利亚州乡村区域的地理编码地址点,附带GNAF匹配指标。
#### 核心字段
| 字段名 | 数据类型 | 描述 |
|------------|-----------|-------------|
| `X` | 浮点型 | 经度坐标(WGS84) |
| `Y` | 浮点型 | 纬度坐标(WGS84) |
| `gnaf_confidence` | 浮点型 | GNAF地址匹配置信度评分(0–2) |
| `distance_to_gnaf` | 浮点型 | 地址点与匹配的GNAF记录之间的距离(单位:米) |
| `gnaf_lat` | 浮点型 | 匹配的GNAF记录的纬度 |
| `gnaf_long` | 浮点型 | 匹配的GNAF记录的经度 |
| `geometry` | 点 | 地理空间点几何 |
---
### 4. 林火规划区(`bushfire_gdf`)
本图层包含林火规划区及相关面积指标,用于灾害评估。
#### 核心字段
| 字段名 | 数据类型 | 描述 |
|------------|-----------|-------------|
| `bpa_areaha` | 浮点型 | 林火规划区面积(单位:公顷) |
| `geometry` | 多面 | 地理空间多边形几何 |
---
### 5. 历史火灾事件(`fire_history_gdf`)
本图层包含历史林火事件边界,附带发生频次与焚烧指标。
#### 核心字段
| 字段名 | 数据类型 | 描述 |
|------------|-----------|-------------|
| `firecount` | 整型 | 区域内记录的火灾次数 |
| `burncount` | 整型 | 区域内记录的焚烧事件次数 |
| `allcount` | 整型 | 火灾与焚烧事件总次数 |
| `yrsfrburn` | 整型 | 上次记录焚烧事件以来的年数 |
| `geometry` | 多面 | 地理空间多边形几何 |
---
### 6. 火灾管理区(`fire_manage_gdf`)
本图层包含火灾管理区及分区类型分类。
#### 核心字段
| 字段名 | 数据类型 | 描述 |
|------------|-----------|-------------|
| `zonetype` | 浮点型 | 火灾管理区类型的分类代码(0.0、1.0、3.0) |
| `geometry` | 多面 | 地理空间多边形几何 |
---
### 7. 可再生能源场址(`renewable_gdf`)
本图层包含可再生能源项目场址及地理坐标。
#### 核心字段
| 字段名 | 数据类型 | 描述 |
|------------|-----------|-------------|
| `Y` | 浮点型 | 纬度坐标(WGS84) |
| `X` | 浮点型 | 经度坐标(WGS84) |
| `geometry` | 点 | 地理空间点几何 |
---
### 8. 输电变电站(`transmission_station_gdf`)
本图层包含电力输电变电站及电压等级信息。
#### 核心字段
| 字段名 | 数据类型 | 描述 |
|------------|-----------|-------------|
| `voltage` | 整型 | 输电变电站的电压等级(单位:千伏:66、110、220、400) |
| `geometry` | 点 | 地理空间点几何 |
---
### 9. 原生植被(`native_veg_gdf`)
本图层包含原生植被多边形,附带生态系统分类与面积指标。
#### 核心字段
| 字段名 | 数据类型 | 描述 |
|------------|-----------|-------------|
| `evc_bcs` | 字符串 | 生物气候层分类(E、LC、V) |
| `evc_mut` | 字符串 | 生态植被类别(Ecological Vegetation Class, EVC)类型(镶嵌体、EVC) |
| `areasqm` | 浮点型 | 植被多边形面积(单位:平方米) |
| `xgroupname` | 字符串 | 生态系统组名称(例如:干旱森林、矮灌林) |
| `geometry` | 多面 | 地理空间多边形几何 |
---
### 10. 变电站-房产邻近性(`station_property_gdf`)
本图层将房产边界与邻近的输电变电站关联,附带距离指标。
#### 核心字段
| 字段名 | 数据类型 | 描述 |
|------------|-----------|-------------|
| `PFI` | 字符串 | 持久特征标识符——关联房产边界的主键 |
| `station_id` | 整型 | 输电变电站的唯一标识符 |
| `distance_to_station_km` | 浮点型 | 房产至输电变电站的距离(单位:千米) |
| `geometry` | 多边形/多面 | 地理空间多边形几何 |
---
### 11. 特征向量(`x_vector`)
本图层包含用于机器学习模型训练的归一化特征向量,以PFI(持久特征标识符)为索引。
#### 核心字段
| 字段名 | 数据类型 | 描述 |
|------------|-----------|-------------|
| `total_facilities_5km` | 浮点型 | 房产5公里范围内的设施归一化数量 |
| `closest_facility_distance` | 浮点型 | 至最近设施的归一化距离(单位:米) |
| `is_prone` | 整型 | 二进制指示符(1 = 易发生火灾,0 = 不易发生火灾) |
| `type0` | 整型 | 房产类型类别的二进制分类 |
| `veg_area` | 浮点型 | 房产上原生植被的归一化面积(单位:平方米) |
| `evc_mut_0` | 整型 | 生态植被类别(EVC)类型类别的二进制指示符 |
| `evc_bcs_0` | 整型 | 生物气候层分类类别的二进制指示符 |
| `xgroup_0` | 整型 | 生态系统组类别的二进制指示符 |
| `fire_count` | 浮点型 | 区域内历史火灾事件的归一化数量 |
| `yrs_since_last_burn` | 浮点型 | 上次记录焚烧事件以来的归一化年数 |
| `PFI` | 字符串 | 索引:持久特征标识符(主键) |
---
### 12. 风险标签(`y_labels`)
本图层包含用于监督模型训练的二进制风险标签(高/低),以PFI为索引。
#### 核心字段
| 字段名 | 数据类型 | 描述 |
|------------|-----------|-------------|
| `is_high_risk` | 整型 | 二进制指示符(1 = 高火灾风险,0 = 非高风险) |
| `is_low_risk` | 整型 | 二进制指示符(1 = 低火灾风险/具备可再生能源开发潜力,0 = 非低风险) |
| `PFI` | 字符串 | 索引:持久特征标识符(主键) |
---
### 13. 风险预测结果(`risk_results`)
本图层包含所有房产的最终火灾风险预测输出(已标注 + 未标注),以PFI为索引。
#### 核心字段
| 字段名 | 数据类型 | 描述 |
|------------|-----------|-------------|
| `risk_probability` | 浮点型 | 高火灾风险的预测概率(0.0–1.0,保留4位小数) |
| `risk_score` | 浮点型 | 缩放至0–100区间的风险概率(保留1位小数) |
| `risk_level` | 字符串 | 分类风险等级(低风险:0–30,中风险:30–80,高风险:80–100) |
| `PFI` | 字符串 | 索引:持久特征标识符(主键) |
## 使用指南
### 1. 基础设置
python
import os
import inspect
from huggingface_hub import HfApi, hf_hub_download
# 核心配置
HF_REPO_ID = "Loveffort/Capstone-dataset"
HF_TOKEN = os.getenv("HF_TOKEN")
IS_COLAB = "COLAB_GPU" in os.environ
WORK_DIR = "/content/victoria_fire_risk_data" if IS_COLAB else "./victoria_fire_risk_data"
os.makedirs(WORK_DIR, exist_ok=True)
### 2. 初始化HF数据管理器
python
class HFDataManager:
def __init__(self, repo_id, token, work_dir):
self.repo_id = repo_id
self.token = token
self.work_dir = work_dir
self.api = HfApi(token=token) if token else None
def load_geo_data(self, hf_file_path):
if hf_file_path.lower().endswith(".shp"):
base = hf_file_path.replace(".shp", "")
local_files = {}
for ext in [".shp", ".shx", ".dbf", ".prj", ".cpg"]:
filename = f"{base}{ext}"
try:
local_file = hf_hub_download(
repo_id=self.repo_id,
repo_type="dataset",
filename=filename,
token=self.token,
cache_dir=self.work_dir
)
local_files[ext] = local_file
except Exception as e:
print(f"警告:未找到{filename}({e})")
shp_path = local_files.get(".shp")
if not shp_path:
raise FileNotFoundError("未在仓库中找到形状文件.shp")
gdf = gpd.read_file(shp_path)
else:
local_file = hf_hub_download(
repo_id=self.repo_id,
repo_type="dataset",
filename=hf_file_path,
token=self.token,
cache_dir=self.work_dir
)
if hf_file_path.endswith(".parquet"):
gdf = gpd.read_parquet(local_file)
elif hf_file_path.endswith(".geojson"):
gdf = gpd.read_file(local_file)
elif hf_file_path.endswith(".csv"):
gdf = gpd.read_file(local_file, driver="CSV")
else:
raise ValueError("不支持的文件格式")
return gdf
def save_gdf(self, gdf, format="parquet"):
if not self.api:
raise ValueError("未配置HF访问令牌!")
frame = inspect.currentframe().f_back
file_name = [k for k, v in frame.f_locals.items() if v is gdf][0]
file_fullname = f"{file_name}.{format}"
local_path = os.path.join(self.work_dir, file_fullname)
if format == "parquet":
gdf.to_parquet(local_path)
elif format == "csv":
gdf.to_csv(local_path, index=True)
elif format == "geojson":
gdf.to_file(local_path, driver="GeoJSON")
else:
raise ValueError("不支持的文件格式")
self.api.upload_file(
path_or_fileobj=local_path,
path_in_repo=file_fullname,
repo_id=self.repo_id,
repo_type="dataset",
commit_message=f"保存{file_fullname}"
)
print(f"已上传至HF Hub:{self.repo_id}/{file_fullname}")
hf_manager = HFDataManager(HF_REPO_ID, HF_TOKEN, WORK_DIR)
### 3. 使用示例
python
metro_gdf = hf_manager.load_geo_data("metro_gdf.parquet")
hf_manager.save_gdf(metro_gdf, format="parquet") # 默认格式为parquet
提供机构:
Loveffort



