Tr4m0ryp/espresso-v2-carbon-water-data
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Tr4m0ryp/espresso-v2-carbon-water-data
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
task_categories:
- tabular-regression
language:
- en
tags:
- environment
- sustainability
- textile
- life-cycle-assessment
- synthetic-data
- carbon-footprint
- water-footprint
- supply-chain
- lca
size_categories:
- 10K<n<100K
pretty_name: "ESPResso V2: Textile Carbon & Water Footprint Training Data"
---
# ESPResso V2: Textile Carbon & Water Footprint Training Data
50,000 synthetic records for product-level carbon and water footprint prediction in textiles, generated by a 7-layer LLM-orchestrated pipeline with deterministic C99 calculation engines. Developed at the University of Amsterdam.
## Dataset Description
The textile industry faces mounting regulatory pressure under the EU ESPR and Digital Product Passport mandate to quantify product-level environmental footprints. Comprehensive Life Cycle Assessment (LCA) data is prohibitively expensive to produce at scale, creating a barrier for the thousands of brands that must comply.
ESPResso V2 addresses this gap by providing two training datasets spanning 47 product categories, 105 subcategories, and 87 base materials across fashion and apparel:
- **carbon_footprint.parquet** -- 49,732 records, 27 columns. Material composition, manufacturing sequences, transport logistics, packaging, and carbon footprint targets broken down by lifecycle stage (raw materials, transport, processing, packaging) in kgCO2e.
- **water_footprint.parquet** -- 50,480 records, 15 columns. Material composition, manufacturing, supply chain geography with AWARE water stress factors, and water footprint targets by lifecycle stage in m3 world-equivalent.
- **category_stats.json** -- Per-category statistics for all 47 product categories.
## Data Sources
The deterministic calculation engines draw on established environmental databases and standards:
- **EcoInvent 3.12**: Emission factors (kgCO2e/kg) and water use intensities (m3/kg)
- **Agribalyse 3.2**: Agricultural water footprint data for natural fibers
- **AWARE 2.0**: Country-level water stress characterization factors (range 0.1 to 100)
- **Standards**: ISO 14040/14044 (LCA framework), PEFCR v3.1 (Product Environmental Footprint Category Rules)
## Generation Methodology
The data is produced by a 7-layer pipeline that separates creative configuration from deterministic calculation:
- **Layers 1--4**: Claude Sonnet 4.6 generates realistic product configurations -- material blends, ordered manufacturing sequences, multi-leg transport routes with WGS84 coordinates, and packaging specifications.
- **Layer 5**: Claude Sonnet 4.5 validates outputs through a 5-stage quality gate: MD5 integrity checks, semantic coherence scoring (threshold >= 0.85), 3-sigma statistical outlier detection, deduplication, and reward model scoring.
- **Layers 6--7**: Deterministic C99 engines compute carbon and water footprints from the established emission factor databases listed above. No LLM is involved in the footprint calculation itself.
**Key design principle**: LLMs generate realistic supply chain configurations; deterministic engines compute ground-truth footprints from peer-reviewed emission factor databases.
## Column Schema
### carbon_footprint.parquet (49,732 records, 27 columns)
| Column | Type | Description |
|--------|------|-------------|
| record_id | string | Unique record identifier |
| category_name | string | Product category (47 values, e.g., Dresses, Jeans, Knitwear) |
| subcategory_name | string | Product subcategory (105 values) |
| materials | string (JSON list) | List of material names |
| material_percentages | string (JSON list) | Weight percentages per material (sum to 100%) |
| total_weight_kg | float64 | Total product weight in kg |
| total_packaging_mass_kg | float64 | Total packaging mass in kg |
| preprocessing_steps | string (JSON list) | Ordered manufacturing steps |
| step_locations | string (JSON dict) | WGS84 coordinates per processing step |
| packaging_categories | string (JSON list) | Packaging material types |
| packaging_masses_kg | string (JSON list) | Mass per packaging component in kg |
| step_zscore | float64 | Quality z-score for processing steps |
| stage_coverage | float64 | Manufacturing stage coverage score (0--1) |
| material_chains | string (JSON dict) | Per-material processing chains with coordinates |
| road_km | float64 | Total road transport distance in km |
| sea_km | float64 | Total sea transport distance in km |
| rail_km | float64 | Total rail transport distance in km |
| air_km | float64 | Total air transport distance in km |
| inland_waterway_km | float64 | Total inland waterway distance in km |
| total_transport_distance_km | float64 | Sum of all transport distances in km |
| road_frac | float64 | Road fraction of total transport |
| sea_frac | float64 | Sea fraction of total transport |
| cf_raw_materials_kg_co2e | float64 | **Target:** Raw materials carbon footprint in kgCO2e |
| cf_transport_kg_co2e | float64 | **Target:** Transport carbon footprint in kgCO2e |
| cf_processing_kg_co2e | float64 | **Target:** Processing carbon footprint in kgCO2e |
| cf_packaging_kg_co2e | float64 | **Target:** Packaging carbon footprint in kgCO2e |
| is_outlier | bool | Statistical outlier flag |
### water_footprint.parquet (50,480 records, 15 columns)
| Column | Type | Description |
|--------|------|-------------|
| record_id | string | Unique record identifier |
| category_name | string | Product category (47 values) |
| subcategory_name | string | Product subcategory (105 values) |
| materials | string (JSON list) | List of material names |
| material_weights_kg | string (JSON list) | Per-material weights in kg |
| material_percentages | string (JSON list) | Weight percentages per material |
| preprocessing_steps | string (JSON list) | Ordered manufacturing steps |
| total_weight_kg | float64 | Total product weight in kg |
| total_packaging_mass_kg | float64 | Total packaging mass in kg |
| packaging_categories | string (JSON list) | Packaging material types |
| material_journeys | string (JSON list) | Origin/processing countries with coordinates and AWARE factors |
| wf_raw_materials_m3_world_eq | float64 | **Target:** Raw materials water footprint in m3 world-eq |
| wf_processing_m3_world_eq | float64 | **Target:** Processing water footprint in m3 world-eq |
| wf_packaging_m3_world_eq | float64 | **Target:** Packaging water footprint in m3 world-eq |
| wf_total_m3_world_eq | float64 | **Target:** Total water footprint in m3 world-eq |
## Target Variable Statistics
### Carbon footprint targets
| Target | Min | Max | Mean | Std |
|--------|-----|-----|------|-----|
| Raw materials | 0.073 kgCO2e | 47.672 kgCO2e | 4.347 kgCO2e | 5.153 kgCO2e |
| Transport | 0.000 kgCO2e | 8.465 kgCO2e | 0.263 kgCO2e | 0.245 kgCO2e |
| Processing | 0.033 kgCO2e | 27.881 kgCO2e | 3.729 kgCO2e | 2.898 kgCO2e |
| Packaging | 0.066 kgCO2e | 0.749 kgCO2e | 0.246 kgCO2e | 0.102 kgCO2e |
| Total | 0.294 kgCO2e | 67.946 kgCO2e | 8.585 kgCO2e | 7.453 kgCO2e |
### Water footprint targets
| Target | Min | Max | Mean | Std |
|--------|-----|-----|------|-----|
| Raw materials | 0.000 m3 | 86.418 m3 | 4.247 m3 | 6.840 m3 |
| Processing | 0.001 m3 | 25.965 m3 | 1.295 m3 | 1.588 m3 |
| Packaging | 0.000 m3 | 0.007 m3 | 0.002 m3 | 0.001 m3 |
| Total | 0.006 m3 | 112.385 m3 | 5.544 m3 | 7.716 m3 |
## Footprint Formulations
### Carbon footprint
$$CF_{\text{total}} = (CF_{\text{raw}} + CF_{\text{processing}} + CF_{\text{transport}} + CF_{\text{packaging}}) \times 1.02$$
Where:
- $CF_{\text{raw}} = \sum_i w_i \cdot EF_i$ -- mass times emission factor per material
- $CF_{\text{processing}} = \sum_s EF_{\text{step}(s)} \cdot w_{\text{material}(s)}$ -- energy intensity times mass per manufacturing step
- $CF_{\text{transport}} = \sum_{l=1}^{L} d_l \cdot m \cdot EF_{\text{mode}(l)}$ -- per-leg distance, mass, and transport mode factor
- $CF_{\text{packaging}} = \sum_p w_p \cdot EF_p$ -- packaging mass times emission factor
The 1.02 multiplier accounts for a 2% end-of-life overhead factor.
### Water footprint
The water footprint follows the AWARE 2.0 methodology, where location-specific water stress characterization factors amplify volumetric water consumption:
- $WF_{\text{raw}} = \sum_i w_i \cdot WU_i \cdot CF_{\text{AWARE}}(c_i)$
- $WF_{\text{processing}} = \sum_s WU_{\text{step}(s)} \cdot w_{\text{material}(s)} \cdot CF_{\text{AWARE}}(c_s)$
- $WF_{\text{packaging}} = \sum_p w_p \cdot WU_p \cdot CF_{\text{AWARE}}(c_p)$
AWARE characterization factors range from 0.1 (water-abundant regions) to 100 (severely water-stressed regions), creating up to 40--100x geographic variance in water footprint for the same physical water volume.
## Intended Use
**Primary use**: Train machine learning models to predict product-level carbon and water footprints from partial supply chain data. The companion ESPResso V2 models achieve R2 = 0.988 (carbon) and R2 = 0.969 (water) on held-out test sets.
**Additional uses**:
- Benchmarking multi-output regression architectures on environmental impact data
- Studying material-geography-footprint relationships in textile supply chains
- Building sustainable fashion recommendation or decision-support systems
- Educational use in LCA and environmental informatics courses
## Limitations and Out-of-Scope Uses
- This dataset is **not a substitute** for formal LCA conducted by certified practitioners under ISO 14040/14044.
- The data is **synthetic** -- it covers realistic but not exhaustive product configurations.
- Emission factors are drawn from **EcoInvent 3.12 (2024)** and will require updating as databases are revised.
- Coverage is limited to **47 textile product categories**; non-textile products are not represented.
- Transport distances are computed from WGS84 coordinates and may not reflect actual routing.
## Usage Example
Using the Hugging Face `datasets` library:
```python
from datasets import load_dataset
ds = load_dataset("Tr4m0ryp/espresso-v2-carbon-water-data")
# Access carbon footprint data
carbon = ds["carbon_footprint"]
print(f"Carbon records: {len(carbon)}")
# Access water footprint data
water = ds["water_footprint"]
print(f"Water records: {len(water)}")
```
Or load directly with pandas:
```python
import pandas as pd
carbon = pd.read_parquet(
"hf://datasets/Tr4m0ryp/espresso-v2-carbon-water-data/carbon_footprint.parquet"
)
water = pd.read_parquet(
"hf://datasets/Tr4m0ryp/espresso-v2-carbon-water-data/water_footprint.parquet"
)
```
## Dataset Splits
No pre-defined splits are provided. The companion ESPResso V2 models use a 70/15/15 train/validation/test split stratified by product category.
## Citation
```bibtex
@misc{espresso-v2-2026,
title={ESPResso V2: LLM-Orchestrated Synthetic Data Pipeline and Neural Estimation
of Product-Level Carbon and Water Footprints in Textiles},
author={Ouallaf, Moussa},
year={2026},
institution={University of Amsterdam},
url={https://github.com/tr4m0ryp/ESPResso-V2}
}
```
## License
CC BY-SA 4.0. If you use this dataset, please cite the ESPResso V2 project.
## Acknowledgments
- University of Amsterdam
- UvA AI Chat (LLM API access for data generation)
- EcoInvent 3.12, Agribalyse 3.2, AWARE 2.0 (environmental databases)
- ISO 14040/14044, PEFCR v3.1 (methodological standards)
提供机构:
Tr4m0ryp



