Replication Package for: When Should I Run My Application Benchmark? Studying Cloud Performance Variability for the Case of Stream Processing Applications
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14617187
下载链接
链接失效反馈官方服务:
资源简介:
This replication package contains the data and scripts to replicate the results of our paper "When Should I Run My Application Benchmark?: Studying Cloud Performance Variability for the Case of Stream Processing Applications".
It contains throughput measurements of 2366 executions of ShuffleBench, an open-source application benchmark for stream processing frameworks. The experiments have been conducted over a period of more than 3 month in 2024 in AWS EKS Kubernetes clusters in two AWS regions (us-east-1 and eu-central-1) and with different two different machine types (m6i and m6g).
Dataset Description
The benchmark results are located in the results directory, following the structure
results/{DATE}_{TIME}-{INSTANCE}-{REGION}/results/exp0_250000_9_generic_throughput_{IDX}.csv
where:
{DATE} is the date of the execution in the format YYYY-MM-DD,
{TIME} is the time of the execution in the format HH-MM-SS,
{INSTANCE} is the instance type used for the execution (m6i or m6g),
{REGION} is the AWS region used for the execution (useast1 or eucentral1),
{IDX} is the number of the repetition of an execution (1-3).
Each of these CVS files contains the throughput measurements of the benchmark executions, every 5 seconds. The important columns are:
timestamp in epoch seconds,
value the measured throughput in records per second as obtained with the ad-hoc throughput metric of ShuffleBench.
In addition to that, results/{DATE}_{TIME}-{INSTANCE}-{REGION} also contains a theodolite.log file that contains the logs of the Theodolite benchmarking tool and the logged configuration of each execution in results. Although we do not expect them to provide additional insights (since the purpose of our study was to repeatedly execute the same benchmark), we refer to the documentation of Theodolite for further details.
Dataset Analysis
To repeat our data analysis, you can use the Jupyter notebook results-analysis.ipynb following these steps:
(Optional:) Create a Python virtual environment and activate it:python3 -m venv .venvsource .venv/bin/activate
Install the required Python packages:pip install -r requirements.txt
Start Jupyter, for example, via:jupyter notebook
This Jupyter notebook also allows users to conduct further analysis on the dataset.
Periodic Benchmark Executor
The `periodic-executor` directory contains scripts and configuration files used to automatically execute ShuffleBench. As ShuffleBench relies on the Theodolite benchmarking framework for executing benchmarks within Kubernetes, the code here is mostly for setting up a Kubernetes cluster, installing Theodolite, configuring the benchmark executions, and collecting the benchmark results.
The periodic benchmark executor is bundled as a Docker image. It can be built and pushed to an ECR repository with the following commands:
docker build -t $ECR_REPOSITORY/$IMAGE_NAME .
docker push $ECR_REPOSITORY/$IMAGE_NAME
To automatically run this container, a AWS Elastic Container Service (ECS) task definition and a scheduled task has to be be created.
For storing the benchmark results, an S3 bucket with the name shufflebench-periodic-schedule-results has to be created.
Finally, the required IAM permissions have to be set up. For confidentiality reasons, we cannot provide the exact IAM policy here, but required permissions include creation of an EKS Kubernetes cluster via eksctl (see the official documentation) and access to the S3 bucket.
创建时间:
2025-02-05



