Public Utility Data Liberation Project (PUDL) Data Release
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/3653158
下载链接
链接失效反馈官方服务:
资源简介:
PUDL v2025.2.0 Data Release
This is our regular quarterly release for 2025Q1. It includes updates to all the datasets that are published with quarterly or higher frequency, plus initial verisons of a few new data sources that have been in the works for a while.
One major change this quarter is that we are now publishing all processed PUDL data as Apache Parquet files, alongside our existing SQLite databases. See Data Access for more on how to access these outputs.
Some potentially breaking changes to be aware of:
In the EIA Form 930 – Hourly and Daily Balancing Authority Operations Report a number of new energy sources have been added, and some old energy sources have been split into more granular categories. See Changes in energy source granularity over time.
We are now running the EPA’s CAMD to EIA unit crosswalk code for each individual year starting from 2018, rather than just 2018 and 2021, resulting in more connections between these two datasets and changes to some sub-plant IDs. See the note below for more details.
Many thanks to the organizations who make these regular updates possible! Especially GridLab, RMI, and the ZERO Lab at Princeton University. If you rely on PUDL and would like to help ensure that the data keeps flowing, please consider joining them as a PUDL Sustainer, as we are still fundraising for 2025.
New Data
EIA 176
Add a couple of semi-transformed interim EIA-176 (natural gas sources and dispositions) tables. They aren’t yet being written to the database, but are one step closer. See #3555 and PRs #3590, #3978. Thanks to @davidmudrauskas for moving this dataset forward.
Extracted these interim tables up through the latest 2023 data release. See #4002 and #4004.
EIA 860
Added EIA 860 Multifuel table. See #3438 and #3946.
FERC 1
Added three new output tables containing granular utility accounting data. See #4057, #3642 and the table descriptions in the data dictionary:
out_ferc1__yearly_detailed_income_statements
out_ferc1__yearly_detailed_balance_sheet_assets
out_ferc1__yearly_detailed_balance_sheet_liabilities
SEC Form 10-K Parent-Subsidiary Ownership
We have added some new tables describing the parent-subsidiary company ownership relationships reported in the SEC’s Form 10-K, Exhibit 21 “Subsidiaries of the Registrant”. Where possible these tables link the SEC filers or their subsidiary companies to the corresponding EIA utilities. This work was funded by a grant from the Mozilla Foundation. Most of the ML models and data preparation took place in the mozilla-sec-eia repository separate from the main PUDL ETL, as it requires processing hundreds of thousands of PDFs and the deployment of some ML experiment tracking infrastructure. The new tables are handed off as nearly finished products to the PUDL ETL pipeline. Note that these are preliminary, experimental data products and are known to be incomplete and to contain errors. Extracting data tables from unstructured PDFs and the SEC to EIA record linkage are necessarily probabalistic processes.
See PRs #4026, #4031, #4035, #4046, #4048, #4050 and check out the table descriptions in the PUDL data dictionary:
out_sec10k__parents_and_subsidiaries
core_sec10k__quarterly_filings
core_sec10k__quarterly_exhibit_21_company_ownership
core_sec10k__quarterly_company_information
Expanded Data Coverage
EPA CEMS
Added 2024 Q4 of CEMS data. See #4041 and #4052.
EPA CAMD EIA Crosswalk
In the past, the crosswalk in PUDL has used the EPA’s published crosswalk (run with 2018 data), and an additional crosswalk we ran with 2021 EIA 860 data. To ensure that the crosswalk reflects updates in both EIA and EPA data, we re-ran the EPA R code which generates the EPA CAMD EIA crosswalk with 4 new years of data: 2019, 2020, 2022 and 2023. Re-running the crosswalk pulls the latest data from the CAMD FACT API, which results in some changes to the generator and unit IDs reported on the EPA side of the crosswalk, which feeds into the creation of core_epa__assn_eia_epacamd.
The changes only result in the addition of new units and generators in the EPA data, with no changes to matches at the plant level. However, the updates to generator and unit IDs have resulted in changes to the subplant IDs - some EIA boilers and generators which previously had no matches to EPA data have now been matched to EPA unit data, resulting in an overall reduction in the number of rows in the core_epa__assn_eia_epacamd_subplant_ids table. See issues #4039 and PR #4056 for a discussion of the changes observed in the course of this update.
EIA 860M
Added EIA 860m through December 2024. See #4038 and #4047.
EIA 923
Added EIA 923 monthly data through September 2024. See #4038 and #4047.
EIA Bulk Electricity Data
Updated the EIA Bulk Electricity data to include data published up through 2024-11-01. See #4042 and PR #4051.
EIA 930
Updated the EIA 930 data to include data published up through the beginning of February 2025. See #4040 and PR #4054. 10 new energy sources were added and 3 were retired; see Changes in energy source granularity over time for more information.
Bug Fixes
Fix an accidentally swapped set of starting balance / ending balance column rename parameters in the pre-2021 DBF derived data that feeds into core_ferc1__yearly_other_regulatory_liabilities_sched278. See issue #3952 and PRs #3969, #3979. Thanks to @yolandazzz13 for making this fix.
Added preliminary data validation checks for several FERC 1 tables that were missing it #3860.
Fix spelling of Lake Huron and Lake Saint Clair in out_vcerare__hourly_available_capacity_factor and related tables. See issue #4007 and PR #4029.
Quality of Life Improvements
We added a sources parameter to pudl.metadata.classes.DataSource.from_id() in order to make it possible to use the pudl-archiver repository to archive datasets that won’t necessarily be ingested into PUDL. See this PUDL archiver issue and PRs #4003 and #4013.
Other PUDL v2025.2.0 Resources
PUDL v2025.2.0 Data Dictionary
PUDL v2025.2.0 Documentation
PUDL in the AWS Open Data Registry
PUDL v2025.2.0 in a free, public AWS S3 bucket: s3://pudl.catalyst.coop/v2025.2.0/
PUDL v2025.2.0 in a requester-pays GCS bucket: gs://pudl.catalyst.coop/v2025.2.0/
Zenodo archive of the PUDL GitHub repo for this release
PUDL v2025.2.0 release on GitHub
PUDL v2025.2.0 package in the Python Package Index (PyPI)
Contact Us
If you're using PUDL, we would love to hear from you! Even if it's just a note to let us know that you exist, and how you're using the software or data. Here's a bunch of different ways to get in touch:
Follow us on GitHub
Use the PUDL Github issue tracker to let us know about any bugs or data issues you encounter
GitHub Discussions is where we provide user support.
Watch our GitHub Project to see what we're working on.
Email us at hello@catalyst.coop for private communications.
On Mastodon: @CatalystCoop@mastodon.energy
On BlueSky: @catalyst.coop
On Twitter: @CatalystCoop
Connect with us on LinkedIn
Play with our data and notebooks on Kaggle
Combine our data with ML models on HuggingFace
Learn more about us on our website: https://catalyst.coop
Subscribe to our announcements list for email updates.
# PUDL v2025.2.0 数据发布版本
本版本为2025年第一季度的常规季度更新。我们对所有按季度或更高频率发布的数据集进行了更新,同时推出了数个筹备已久的全新数据源的初始版本。
本季度的一项重大变更为,我们现在将所有处理后的PUDL数据以Apache Parquet格式发布,同时保留原有的SQLite数据库格式。如需了解如何获取这些输出文件,请参阅「数据获取」章节。
请注意以下几项可能涉及不兼容的更新:
1. 在**美国能源信息署(Energy Information Administration, EIA)** Form 930——《每小时及每日平衡机构运营报告》中,我们新增了多个能源来源类别,并将部分原有能源来源拆分为更细粒度的分类。如需了解详情,请参阅「能源来源粒度的历史变更」章节。
2. 本季度我们不再仅针对2018年运行美国环境保护署(United States Environmental Protection Agency, EPA)与EIA的机组交叉映射代码,而是针对2018年起的每一个单独年份运行该代码,这使得两个数据集之间的关联数量有所增加,同时部分子电厂ID也发生了变更。如需了解更多细节,请参阅下文说明。
衷心感谢为本次常规更新提供支持的各机构!尤其感谢GridLab、落基山研究所(Rocky Mountain Institute, RMI)以及普林斯顿大学ZERO实验室。若您正在使用PUDL并希望助力数据服务的持续运营,请考虑成为PUDL赞助会员(PUDL Sustainer),我们目前仍在为2025年的运营筹集资金。
## 新增数据集
### EIA 176
新增两张半转换的中期EIA-176(天然气来源与处置)表格。目前这些表格尚未写入数据库,但已向最终完成又迈进了一步。相关工作可参阅议题#3555以及拉取请求#3590、#3978。感谢@davidmudrauskas推动该数据集的开发。
我们已将这些中期表格提取至最新的2023年数据发布版本。相关工作可参阅议题#4002与#4004。
### EIA 860
新增EIA 860多燃料表格。相关工作可参阅议题#3438与#3946。
### FERC 1
新增三张包含细粒度公用事业核算数据的输出表格。相关工作可参阅议题#4057、#3642以及数据字典中的表格说明:
- out_ferc1__yearly_detailed_income_statements
- out_ferc1__yearly_detailed_balance_sheet_assets
- out_ferc1__yearly_detailed_balance_sheet_liabilities
### SEC Form 10-K 母公司-子公司所有权
我们新增了若干表格,用于描述**美国证券交易委员会(Securities and Exchange Commission, SEC)**《10-K表格》附录21「注册人子公司」中披露的母公司-子公司公司所有权关系。在可行的情况下,这些表格会将SEC申报实体或其子公司与对应的EIA公用事业机构进行关联。本项工作由Mozilla基金会提供资助完成。大部分机器学习模型与数据预处理工作在独立于主PUDL抽取-转换-加载(Extract, Transform, Load, ETL)流程的`mozilla-sec-eia`仓库中开展,因为该流程需要处理数十万份PDF文档,并部署了部分机器学习实验跟踪基础设施。新增表格已作为近乎完成的成品交付至PUDL ETL流水线。请注意,这些均为初步的试验性数据产品,已知存在不完整与错误之处。从非结构化PDF文档中提取数据表以及SEC与EIA记录的关联过程均为概率性流程。
相关工作可参阅拉取请求#4026、#4031、#4035、#4046、#4048、#4050,同时可查阅PUDL数据字典中的表格说明:
- out_sec10k__parents_and_subsidiaries
- core_sec10k__quarterly_filings
- core_sec10k__quarterly_exhibit_21_company_ownership
- core_sec10k__quarterly_company_information
## 数据覆盖范围扩展
### EPA CEMS
新增2024年第四季度的CEMS数据。相关工作可参阅议题#4041与#4052。
### EPA CAMD EIA Crosswalk
此前PUDL中的交叉映射表使用了EPA发布的交叉映射表(基于2018年数据运行)以及我们基于2021年EIA 860数据生成的额外交叉映射表。为确保交叉映射表能够反映EIA与EPA数据的最新更新,我们重新运行了生成EPA CAMD-EIA交叉映射表的EPA R代码,新增了2019、2020、2022及2023四个年份的数据。重新运行交叉映射流程会从CAMD FACT API获取最新数据,这将导致EPA侧交叉映射表中报告的发电机组与机组ID发生变更,该变更会影响core_epa__assn_eia_epacamd表的生成。
本次变更仅新增了EPA数据中的机组与发电机组,未对电厂级别的匹配结果产生影响。不过,发电机组与机组ID的更新导致了子电厂ID发生变更——部分此前未匹配到EPA数据的EIA锅炉与发电机组现已成功匹配至EPA机组数据,这使得core_epa__assn_eia_epacamd_subplant_ids表的总记录数有所减少。如需了解本次更新中观察到的变更详情,请参阅议题#4039与拉取请求#4056。
### EIA 860M
新增截至2024年12月的EIA 860m数据。相关工作可参阅议题#4038与#4047。
### EIA 923
新增截至2024年9月的EIA 923月度数据。相关工作可参阅议题#4038与#4047。
### EIA Bulk Electricity Data
更新EIA Bulk Electricity数据,覆盖至2024年11月1日发布的最新数据。相关工作可参阅议题#4042与拉取请求#4051。
### EIA 930
更新EIA 930数据,覆盖至2025年2月初发布的最新数据。相关工作可参阅议题#4040与拉取请求#4054。本次更新新增了10个能源来源类别,并停用了3个类别;如需了解详情,请参阅「能源来源粒度的历史变更」章节。
## 缺陷修复
1. 修复了pre-2021 DBF衍生数据中一组被意外交换的期初余额/期末余额列重命名参数,该数据用于生成core_ferc1__yearly_other_regulatory_liabilities_sched278表。相关工作可参阅议题#3952与拉取请求#3969、#3979。感谢@yolandazzz13贡献本次修复。
2. 为此前缺失该功能的多张**联邦能源监管委员会(Federal Energy Regulatory Commission, FERC)** 1表格新增了初步的数据验证检查(议题#3860)。
3. 修复了out_vcerare__hourly_available_capacity_factor及相关表格中Lake Huron与Lake Saint Clair的拼写错误。相关工作可参阅议题#4007与拉取请求#4029。
## 用户体验优化
我们为`pudl.metadata.classes.DataSource.from_id()`新增了`sources`参数,以便支持使用`pudl-archiver`仓库归档那些无需被写入PUDL的数据集。相关工作可参阅该PUDL归档器议题以及拉取请求#4003与#4013。
## PUDL v2025.2.0 其他资源
1. PUDL v2025.2.0 数据字典
2. PUDL v2025.2.0 文档
3. AWS开放数据注册表中的PUDL
4. 位于AWS S3公共存储桶中的PUDL v2025.2.0:s3://pudl.catalyst.coop/v2025.2.0/
5. 采用请求者付费模式的GCS存储桶中的PUDL v2025.2.0:gs://pudl.catalyst.coop/v2025.2.0/
6. 本版本PUDL GitHub仓库的Zenodo存档
7. GitHub上的PUDL v2025.2.0发布页面
8. Python包索引(PyPI)中的PUDL v2025.2.0软件包
## 联系我们
若您正在使用PUDL,我们期待收到您的反馈!哪怕只是告知我们您的存在以及您如何使用本软件或数据,也十分欢迎。您可以通过以下多种方式与我们取得联系:
1. 在GitHub关注我们
2. 使用PUDL GitHub议题追踪器报告您遇到的任何漏洞或数据问题
3. 通过GitHub讨论区提供用户支持
4. 关注我们的GitHub项目以了解最新工作进展
5. 发送邮件至hello@catalyst.coop进行私下沟通
6. Mastodon(马斯特顿):@CatalystCoop@mastodon.energy
7. BlueSky(蓝天社交):@catalyst.coop
8. 推特(Twitter):@CatalystCoop
9. 领英(LinkedIn)上与我们建立联系
10. 在Kaggle上体验我们的数据与笔记本
11. 在HuggingFace上将我们的数据与机器学习模型结合使用
12. 访问我们的官网了解更多信息:https://catalyst.coop
13. 订阅我们的公告列表以获取邮件更新。
创建时间:
2025-02-14



