five

Webcrawling and machine learning as a new approach for the spatial distribution of atmospheric emissions

收藏
NIAID Data Ecosystem2026-03-10 收录
下载链接:
https://figshare.com/articles/dataset/Webcrawling_and_machine_learning_as_a_new_approach_for_the_spatial_distribution_of_atmospheric_emissions/6822191
下载链接
链接失效反馈
官方服务:
资源简介:
In this study we apply two methods for data collection that are relatively new in the field of atmospheric science. The two developed methods are designed to collect essential geo-localized information to be used as input data for a high resolution emission inventory for residential wood combustion (RWC). The first method is a webcrawler that extracts openly online available real estate data in a systematic way, and thereafter structures them for analysis. The webcrawler reads online Norwegian real estate advertisements and it collects the geo-position of the dwellings. Dwellings are classified according to the type (e.g., apartment, detached house) they belong to and the heating systems they are equipped with. The second method is a model trained for image recognition and classification based on machine learning techniques. The images from the real estate advertisements are collected and processed to identify wood burning installations, which are automatically classified according to the three classes used in official statistics, i.e., open fireplaces, stoves produced before 1998 and stoves produced after 1998. The model recognizes and classifies the wood appliances with a precision of 81%, 85% and 91% for open fireplaces, old stoves and new stoves, respectively. Emission factors are heavily dependent on technology and this information is therefore essential for determining accurate emissions. The collected data are compared with existing information from the statistical register at county and national level in Norway. The comparison shows good agreement for the proportion of residential heating systems between the webcrawled data and the official statistics. The high resolution and level of detail of the extracted data show the value of open data to improve emission inventories. With the increased amount and availability of data, the techniques presented here add significant value to emission accuracy and potential applications should also be considered across all emission sectors.

本研究采用大气科学领域相对新兴的两种数据采集方法。所开发的两种方法旨在采集必要的地理定位信息,作为民用燃木燃烧(residential wood combustion, RWC)高分辨率排放清单的输入数据。第一种方法为网络爬虫(webcrawler),可系统性抓取公开在线可得的房地产数据,并对其进行结构化处理以供分析。该网络爬虫会读取挪威境内的在线房地产广告,采集住宅的地理位置信息,并依据住宅类型(如公寓、独立住宅)及其配备的供暖系统完成分类。第二种方法为基于机器学习技术训练得到的图像识别与分类模型。研究人员采集并处理房地产广告中的图像,以识别其中的燃木供暖装置,并依据官方统计体系所用的三类标准自动分类:开放式壁炉、1998年前生产的燃木炉具,以及1998年后生产的燃木炉具。该模型对三类装置的识别分类精度分别为81%、85%和91%。排放因子高度依赖所用技术,因此此类信息对于精准核算排放量至关重要。研究将采集到的数据与挪威县级及国家级统计登记数据库中的现有信息进行了对比,结果显示,网络爬虫采集的数据与官方统计在民用供暖系统占比方面一致性良好。本次提取的数据具备高分辨率与高细节度,印证了开放数据在优化排放清单方面的价值。随着数据总量与可获得性不断提升,本文所提出的技术可有效提升排放核算精度,相关方法的潜在应用场景也应覆盖所有排放相关领域。
创建时间:
2018-07-16
二维码
社区交流群
二维码
科研交流群
商业服务