OpenSynth/tud-glass
收藏Hugging Face2026-03-21 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/OpenSynth/tud-glass
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
tags:
- energy
- smartmeter
- synthetic
pretty_name: GLASS
---
# GLASS — Generative-AI Large-customer Smart-meter Samples
## Overview
This dataset contains synthetic daily load profiles for large-customer smart meters, produced by a flow-matching generative model. Each row represents a single day of a single synthetic sample: 96 power values at 15-minute resolution (in kW), conditioned on customer category, generation type, scenario, and consumption/generation level.
This dataset is published as part of [OpenSynth](https://lfenergy.org/projects/opensynth/), an [LF Energy](https://lfenergy.org/) project that democratizes synthetic energy data to accelerate the decarbonization of global energy systems. The dataset is hosted on [HuggingFace](https://huggingface.co/OpenSynth) and [SURF Data Repository](https://repository.surfsara.nl/) and generated using the [SmartMeterFM](https://github.com/sentient-codebot/SmartMeterFM) model trained on [Alliander N.V.](https://www.alliander.com/) smart meter data.
- **File:** `GLASS_v1.0.parquet`
- **Rows:** 43,184,437
- **Columns:** 104 (8 metadata + 96 power values)
### Quick start
```python
import polars as pl
import matplotlib.pyplot as plt
df = pl.read_parquet("GLASS_v1.0.parquet")
# Pick one sample: consumption_only, category 001, 2023-01-15, level 0, sample 0
sample = df.filter(
(pl.col("scenario") == "consumption_only")
& (pl.col("category") == "001")
& (pl.col("date") == "2023-01-15")
& (pl.col("level_id") == 0)
& (pl.col("sample_id") == 0)
)
# Plot the 96 power values for that day
power = sample.select(pl.exclude("scenario", "generation_type", "category",
"date", "level_name", "level_id",
"level_value", "sample_id")).row(0)
plt.plot(power)
plt.xlabel("15-min interval")
plt.ylabel("Power (kW)")
plt.title("Daily load profile")
plt.show()
```
## Schema
### Metadata columns
| Column | Type | Description |
|---|---|---|
| `scenario` | String | One of 4 scenarios (see below) |
| `generation_type` | String | `none`, `solar`, or `wind_on_land` |
| `category` | String | Customer category. 26 distinct values. |
| `date` | String | Calendar date (YYYY-MM-DD), ranging 2022-01-01 to 2024-12-31 |
| `level_name` | String | Human-readable level: `low`, `mid_low`, `mid_high`, `high` |
| `level_id` | Int64 | Level index: 0 = low, 1 = mid_low, 2 = mid_high, 3 = high |
| `level_value` | Float64 | Target total energy for the month (kWh/month) |
| `sample_id` | Int64 | Index of the sample within its condition group, starting from 0, with at least 100 per condition |
### Power columns (96)
Columns `00:00:00` through `23:45:00` (Float32) — average power in kW for each 15-minute interval of the day.
- **Consumption** scenarios: values are non-negative (power drawn from grid)
- **Generation** scenarios: values are non-positive (power exported to grid)
- **Hybrid** scenario: values can be positive or negative
## Samples and conditions
### Condition structure
Each sample is generated from a specific **condition**: a combination of scenario, category, year, month, and level. The dataset covers a controlled product of these dimensions.
### Scenarios
| Scenario | Generation type | Sign convention | Description |
|---|---|---|---|
| `consumption_only` | `none` | non-negative | Pure consumption, no local generation |
| `generation_zon` | `solar` | non-positive | Solar PV generation only |
| `generation_wopl` | `wind_on_land` | non-positive | Onshore wind generation only |
| `hybrid_net_consumption` | `solar` | mixed | Net consumption (consumption minus generation) |
### Customer Categories (26 categories)
Consumption Categories: E3D, E3A, E3B, E3C, 001-020
Generation Categories: PV, WIND
Which profiles appear in each scenario:
- `consumption_only`: 24 profiles (all except PV and WIND)
- `generation_zon`: 25 profiles (all except WIND)
- `generation_wopl`: 1 profile (WIND only)
- `hybrid_net_consumption`: 25 profiles (all except WIND)
The consumption profile categories come from the coupling of the Dutch Chamber of Commerce registration of the companies that are registered at the address of the grid connection. Only their main segment is used, which mainly coincides with their main ISIC segments. If no or multiple branches are found, the fallback option are the [NEDU/EDSN load profile categories]( https://energiedatawijzer.nl/wp-content/uploads/Documenten/Topics_MFF/IC063-Profielcategorisering-E-aansluitingen-v1.0.pdf) which depend on the utility time, the amount of time that a customer is active/has a relatively high load. The generation categories are determined by the registered type of generation.
The meaning of the consumption categories is as follows:
| Category |Description | ISIC category |
|---|---|---|
| 001 | Standard Branches | ? |
| 002 | Agriculture, forestry and fishing | A|
| 003 | Mining and quarrying | B |
| 004 | Manufacturing | C|
| 005 | Electricity, gas, steam and air conditioning supply | D |
| 006 | Water supply; sewerage, waste management and remediation activities | E |
| 007 | Construction | F |
| 008 | Wholesale and retail trade; repair of motor vehicles and motorcycles | G |
| 009 | Transportation and storage | H |
| 010 | Accommodation and food service activities | I |
| 011 | Information and communication | J |
| 012 | Financial and insurance activities | K |
| 013 | Real estate activities | L |
| 014 | Professional, scientific and technical activities | M |
| 015 | Administrative and support service activities, renting and leasing of tangible goods | N |
| 016 | Public administration and defence; compulsory social security | O |
| 017 | Education | P |
| 018 | Human health and social work activities | Q |
| 019 | Arts, entertainment and recreation | R |
| 020 | Other service activities | S |
| E3A | Utility time ≤ 2000 hours | n/a |
| E3B | Utility time > 2000 hours and ≤ 3000 hours | n/a |
| E3C | Utility time > 3000 hours and ≤ 5000 hours | n/a |
| E3D | Utility time ≥ 5000 hours | n/a |
### Years and months
3 years (2022-2024) x 12 months, with one exception: wind generation data (`generation_wopl`) is limited to winter months (October-February) — see notes below.
### Levels
4 levels per month (`low`, `mid_low`, `mid_high`, `high`), with target values (`level_value`) that vary by month. For example, solar generation levels are higher in summer. The `level_value` represents the net consumption (consumption minus generation) for the `hybrid_net_consumption` scenario.
### Samples per condition
Each condition has 100-150 valid samples. Conditions are generated with 1.5x oversampling (150 candidates); after quality filtering, conditions with fewer than 100 valid samples are discarded. Survivors retain all valid samples.
### How rows relate to samples
Rows with the same `scenario`, `category`, year-month (from `date`), `level_id`, and `sample_id` belong to the **same sample** — a month-long profile decomposed into daily rows. These rows are consecutive with contiguous dates covering a full calendar month.
## Level values
The `level_value` column gives the target total energy for the month in kWh. The actual monthly total computed from the power values may deviate slightly from `level_value`.
For consumption and generation-only scenarios, `level_value` is a positive absolute value. The sign convention of the power columns indicates the direction:
- `consumption_only`: positive `level_value`, positive power values
- `generation_zon` / `generation_wopl`: positive `level_value`, negative power values
- `hybrid_net_consumption`: `level_value` can be negative (net export), power values are mixed-sign
## Important notes
1. **Wind data coverage is limited.** The `generation_wopl` scenario only contains data for October-February (winter months). There is not enough training data to support generating good quality March-September wind data.
2. **Low variance within some conditions.** For certain conditions, the generated samples may have low diversity. Theoretically, the conditional distribution p(x|y) can have very low variance for specific conditioning values y. This is a property of the generative model inherited from training data, not a data error.
3. **Sign clamping applied.** Small sign violations from numerical noise in the generative model's ODE integration have been clamped to zero. Consumption values are strictly non-negative, and generation values are strictly non-positive. The `hybrid_net_consumption` scenario is unconstrained.
## License
This dataset is licensed under the [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/) license.
## Reference
This dataset was generated by a flow-matching generative model trained on smart meter data from [Alliander N.V.](https://www.alliander.com/).
Please cite us using:
**Dataset DOI:** `10.4121/b18de4df-0f67-4a6f-aa84-6634cdd63991`
**Dataset description paper DOI:** `TBD`
**Model paper:**
- Nan Lin, Yanbo Wang, Jacco Heres, Peter Palensky, and Pedro P. Vergara.
_**SmartMeterFM: Unifying Smart Meter Data Generative Tasks Using Flow Matching Models.**_ arXiv preprint, 2025. [arXiv:2601.21706](https://arxiv.org/abs/2601.21706)
- DOI: `10.48550/arXiv.2601.21706`
## Acknowledgements
This dataset is part of the project ALIGN4energy (with project number NWA.1389.20.251) of the research programme NWA ORC 2020 which is (partly) financed by the Dutch Research Council (NWO). TU Delft and Alliander are partners of the ALIGN4energy project.
license: cc-by-4.0
tags:
- 能源(energy)
- 智能电表(smartmeter)
- 合成数据(synthetic)
pretty_name: GLASS
---
# GLASS — 生成式AI(Generative AI)大型客户智能电表样本
## 概述
本数据集包含由流匹配(flow-matching)生成式AI模型生成的大型客户智能电表合成每日负荷曲线。每一行代表单个合成样本的单日数据:包含96个15分钟分辨率的功率值(单位:千瓦(kW)),以客户类别、发电类型、场景以及消纳/发电水平作为条件。
本数据集作为[OpenSynth](https://lfenergy.org/projects/opensynth/)项目的一部分发布,该项目隶属于[LF Energy](https://lfenergy.org/),旨在推广合成能源数据以加速全球能源系统脱碳。数据集托管于[HuggingFace](https://huggingface.co/OpenSynth)和[SURF Data Repository](https://repository.surfsara.nl/),并使用基于[Alliander N.V.](https://www.alliander.com/)智能电表数据训练的[SmartMeterFM](https://github.com/sentient-codebot/SmartMeterFM)模型生成。
- **文件**:`GLASS_v1.0.parquet`
- **行数**:43,184,437
- **列数**:104(8个元数据列 + 96个功率值列)
### 快速入门
python
import polars as pl
import matplotlib.pyplot as plt
df = pl.read_parquet("GLASS_v1.0.parquet")
# 选取一个样本:仅消纳场景、类别001、日期2023-01-15、水平0、样本0
sample = df.filter(
(pl.col("scenario") == "consumption_only")
& (pl.col("category") == "001")
& (pl.col("date") == "2023-01-15")
& (pl.col("level_id") == 0)
& (pl.col("sample_id") == 0)
)
# 绘制该日的96个功率值
power = sample.select(pl.exclude("scenario", "generation_type", "category",
"date", "level_name", "level_id",
"level_value", "sample_id")).row(0)
plt.plot(power)
plt.xlabel("15分钟间隔")
plt.ylabel("功率(kW)")
plt.title("每日负荷曲线")
plt.show()
## 数据模式
### 元数据列
| 列名 | 数据类型 | 描述 |
|---|---|---|
| `scenario` | 字符串 | 4种场景之一(详见下文) |
| `generation_type` | 字符串 | 可选`none`(无)、`solar`(太阳能)或`wind_on_land`(陆上风电) |
| `category` | 字符串 | 客户类别,共26种不同取值 |
| `date` | 字符串 | 日历日期(格式为YYYY-MM-DD),时间范围为2022-01-01至2024-12-31 |
| `level_name` | 字符串 | 人类可读的水平标签:`low`(低)、`mid_low`(中低)、`mid_high`(中高)、`high`(高) |
| `level_id` | Int64 | 水平索引:0对应低水平,1对应中低,2对应中高,3对应高 |
| `level_value` | Float64 | 当月目标总能耗(单位:kWh/月) |
| `sample_id` | Int64 | 该条件组内样本的索引,从0开始,每个条件组至少包含100个样本 |
### 功率列(共96列)
列名从`00:00:00`到`23:45:00`,数据类型为Float32,代表每日每个15分钟间隔的平均功率(单位:kW)。
- 仅消纳场景:取值非负(表示从电网取电)
- 纯发电场景:取值非正(表示向电网送电)
- 混合场景:取值可正可负
## 样本与条件
### 条件结构
每个样本由特定的**条件**生成,即场景、类别、年份、月份和水平的组合。本数据集覆盖这些维度的可控组合。
### 场景
| 场景名称 | 发电类型 | 符号约定 | 描述 |
|---|---|---|---|
| `consumption_only` | `none` | 非负 | 纯消纳,无本地发电 |
| `generation_zon` | `solar` | 非正 | 仅太阳能光伏发电 |
| `generation_wopl` | `wind_on_land` | 非正 | 仅陆上风电发电 |
| `hybrid_net_consumption` | `solar` | 混合符号 | 净消纳(消纳量减发电量) |
### 客户类别(共26类)
消费类别:E3D、E3A、E3B、E3C、001-020
发电类别:PV(光伏)、WIND(风电)
各场景包含的概况数量:
- `consumption_only`:24种概况(排除PV和WIND)
- `generation_zon`:25种概况(排除WIND)
- `generation_wopl`:1种概况(仅WIND)
- `hybrid_net_consumption`:25种概况(排除WIND)
消费类别的来源:来自荷兰商会对电网连接地址所属企业的注册信息,仅使用其主要业务板块,该板块主要对应其主要ISIC行业分类。若未找到或存在多个分支,则采用[NEDU/EDSN负荷概况类别](https://energiedatawijzer.nl/wp-content/uploads/Documenten/Topics_MFF/IC063-Profielcategorisering-E-aansluitingen-v1.0.pdf),该分类基于用电时长,即客户活跃/处于高负荷状态的时长。发电类别由注册的发电类型决定。
消费类别的详细含义如下表所示:
| 类别 | 描述 | ISIC类别 |
|---|---|---|
| 001 | 标准分支 | ? |
| 002 | 农业、林业和渔业 | A |
| 003 | 采矿和采石 | B |
| 004 | 制造业 | C |
| 005 | 电力、燃气、蒸汽和空调供应 | D |
| 006 | 供水;污水处理、废物管理和修复活动 | E |
| 007 | 建筑业 | F |
| 008 | 批发和零售业;机动车和摩托车修理 | G |
| 009 | 运输和仓储 | H |
| 010 | 住宿和餐饮服务 | I |
| 011 | 信息和通信 | J |
| 012 | 金融和保险活动 | K |
| 013 | 房地产活动 | L |
| 014 | 专业、科学和技术活动 | M |
| 015 | 行政和支持服务活动、有形动产租赁 | N |
| 016 | 公共管理和国防;强制社会保障 | O |
| 017 | 教育 | P |
| 018 | 人类健康和社会服务活动 | Q |
| 019 | 艺术、娱乐和休闲 | R |
| 020 | 其他服务活动 | S |
| E3A | 用电时长≤2000小时 | 无 |
| E3B | 用电时长>2000小时且≤3000小时 | 无 |
| E3C | 用电时长>3000小时且≤5000小时 | 无 |
| E3D | 用电时长≥5000小时 | 无 |
### 年份与月份
覆盖3年(2022-2024)×12个月,仅一个例外:风电发电场景(`generation_wopl`)仅覆盖冬季月份(10月至2月),详见下文说明。
### 水平等级
每个月包含4个水平等级(`low`、`mid_low`、`mid_high`、`high`),目标值(`level_value`)随月份变化。例如,太阳能发电水平在夏季更高。对于`hybrid_net_consumption`场景,`level_value`代表净消纳量(消纳量减发电量)。
### 每个条件组的样本数量
每个条件组包含100-150个有效样本。条件组通过1.5倍过采样生成(150个候选样本);经过质量过滤后,丢弃有效样本少于100个的条件组,保留的条件组保留所有有效样本。
### 行与样本的对应关系
具有相同`scenario`、`category`、年月(从`date`字段提取)、`level_id`和`sample_id`的行属于**同一个样本**——将月度负荷曲线分解为每日行。这些行的日期连续且覆盖完整自然月。
## 水平值说明
`level_value`列给出当月目标总能耗(单位:kWh)。根据功率值计算的月度实际总能耗可能与`level_value`存在微小偏差。
- 仅消纳和纯发电场景:`level_value`为正的绝对值。功率列的符号约定表示功率流向:
- `consumption_only`:`level_value`为正,功率值为正
- `generation_zon` / `generation_wopl`:`level_value`为正,功率值为负
- `hybrid_net_consumption`:`level_value`可负(表示净送电),功率值符号不限
## 重要说明
1. **风电数据覆盖范围有限**:`generation_wopl`场景仅包含10月至2月(冬季)的数据。目前缺乏足够的训练数据来生成3月至9月的高质量风电数据。
2. **部分条件组内样本多样性较低**:对于某些条件组,生成的样本多样性不足。理论上,条件分布p(x|y)对于特定的条件值y可能具有极低的方差,这是生成模型从训练数据继承的特性,而非数据错误。
3. **符号钳制处理**:生成模型的ODE积分过程中产生的数值噪声导致的微小符号违规已被钳制为零。消纳值严格非负,发电值严格非正。`hybrid_net_consumption`场景无此约束。
## 许可协议
本数据集采用[知识共享署名4.0国际许可协议(Creative Commons Attribution 4.0 International,CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/)进行许可。
## 引用信息
本数据集由基于[Alliander N.V.](https://www.alliander.com/)智能电表数据训练的流匹配生成式AI模型生成。
请按以下方式引用:
**数据集DOI**:`10.4121/b18de4df-0f67-4a6f-aa84-6634cdd63991`
**数据集描述论文DOI**:`TBD`
**模型论文**:
- Nan Lin, Yanbo Wang, Jacco Heres, Peter Palensky, and Pedro P. Vergara.
_**SmartMeterFM:使用流匹配模型统一智能电表数据生成任务**_。arXiv预印本,2025。[arXiv:2601.21706](https://arxiv.org/abs/2601.21706)
- DOI: `10.48550/arXiv.2601.21706`
## 致谢
本数据集是ALIGN4energy项目(项目编号NWA.1389.20.251)的一部分,该项目隶属于NWA ORC 2020研究计划,由荷兰研究委员会(NWO)部分资助。代尔夫特理工大学和Alliander是ALIGN4energy项目的合作伙伴。
提供机构:
OpenSynth



