NLR HPC Eagle Jobs Data and Additional Energy Metrics
收藏DataCite Commons2026-04-22 更新2026-04-25 收录
下载链接:
https://www.osti.gov/servlets/purl/3023273
下载链接
链接失效反馈官方服务:
资源简介:
<p><strong>Overview: </strong>Anonymized job-level records from the Eagle high-performance computing (HPC) system at the National Laboratory of the Rockies (NLR). Each record represents a Slurm batch job with scheduling metadata, resource requests, resource utilization, CPU/GPU energy consumption, and efficiency metrics. Sensitive fields (user, account, job name) are replaced with cryptographic hashes.</p>
<p><strong>System & Timeframe: </strong>Eagle was a 2,000-node, 8-petaflop system operated at NLR from 2019–2024. Data covers the full operational lifetime of the system. Slurm data was processed nightly; timestamps are in Mountain Time. Funding provided by the U.S. Department of Energy, EERE.</p>
<p><strong>Files:</strong></p>
<ul>
<li>esif.hpc.eagle.job-anon.zip — Core anonymized job records (Hive-partitioned Parquet)</li>
<li>esif.hpc.eagle.job-anon-energy-metrics.zip — Same records with additional iLO and Ganglia energy metrics</li>
<li>datacard.md — Full dataset documentation</li>
</ul>
<p>~13.8 million rows, 62 variables. Readable with PyArrow, pandas, DuckDB, Apache Spark, or any Parquet-compatible tool.</p>
<p><strong>Data Collection: </strong>Jobs collected via sacct through a pipeline: Eagle Jobs API → Redpanda → StreamSets → HPCMON API → PostgreSQL. Node-level power from iLO (HP Integrated Lights-Out); GPU power from Ganglia monitoring, joined to jobs via node lists and time ranges.</p>
<p><strong>Preprocessing:</strong></p>
<ul>
<li>Anonymization of name, user, and account fields via cryptographic hashing</li>
<li>Derived columns: queue_wait, cpu_eff, max_mem_eff</li>
<li>Simplified job state mapping (e.g., "CANCELLED BY 12345" → "CANCELLED")</li>
<li>QoS accounting rules (buy-in, standby, or Slurm QoS value)</li>
<li>CPU energy estimated from TDP (200W, Intel Xeon Gold 6154, 18 cores)</li>
<li>Timezone-aware columns (_tz) sourced from LEX accounting database to correctly handle DST transitions</li>
</ul>
<p><strong>Key Variables: </strong></p>
<p>Scheduling: job_id, partition, state_simple, submit_time_tz, start_time_tz, end_time_tz, queue_waitResources: nodes_req/used, processors_req/used, memory_req, wallclock_req/used, gpus_requested</p>
<p>Efficiency: cpu_eff, max_mem_eff</p>
<p>Energy: cpu_energy_tdp_estimated_max/used_watt_hours, node_energy_total_watt_hours (iLO), gpu0/1_energy_total_watt_hours (Ganglia)</p>
<p><strong>Partitions:</strong> bigmem, bigmem-8600, bigscratch, csc, dav, ddn, debug, gpu, haswell, long, mono, short, standard</p>
<p><strong>Job States:</strong> CANCELLED, COMPLETED, FAILED, NODE_FAIL, OUT_OF_MEMORY, PENDING, RUNNING, TIMEOUT</p>
<p><strong>QoS Levels:</strong> Unknown, normal, buy-in, debug, penalty, high, standby</p>
<p><strong>Important Notes:</strong></p>
<ul>
<li>Non-_tz timestamp columns may be off by one hour across DST boundaries; use _tz columns for time difference calculations</li>
<li>Energy fields are null for jobs without monitoring coverage</li>
<li>Job step records and raw Slurm JSONB fields are excluded from this extract</li>
<li>Do not attempt to re-identify individuals from hashed fields</li>
</ul>
提供机构:
National Laboratory of the Rockies
创建时间:
2026-04-01



