Task graphs for benchmarking schedulers
收藏NIAID Data Ecosystem2026-03-11 收录
下载链接:
https://zenodo.org/record/2630384
下载链接
链接失效反馈官方服务:
资源简介:
Workflow Task Graph Dataset
This dataset contains three sets of task graphs representing different types of task workflows:
Elementary - contains trivial graph shapes, such as tasks with no dependencies or simple fork-join graphs. This set should test how the scheduler heuristics react to basic graph scenarios that frequently form parts of larger workflows.
IRW - is inspired by real-world workflows, such as machine learning cross-validation or map-reduce.
Pegasus - is derived from graphs created by Pegasus Synthetic Workflow Generators (https://github.com/pegasus-isi/WorkflowGenerator)
All of the provided task graphs are generated and compatible with ESTEE (https://github.com/It4innovations/estee) that allows to simulate their execution on a distributed system using various scheduling heuristics and environment conditions.
Data Format
Task graphs are stored in {elementary, irw, pegasus}.zip files that contain JSON representation of respective task graphs with the following fields:
`graph_name` - Task graph name
`graph_id` - Unique task graph identifier
`graph` - Task graph representation - list of tasks where each task is represented as a dictionary with the following keys:
`d`: Actual task duration in seconds (float value)
`e_d`: User estimated task duration in seconds (float value)
`cpus`: Task CPU core requirements (integer value)
`outputs`: List of task outputs (list of integers indicating sizes of task outputs in MiB)
`inputs`: List of task inputs in format of list [task\_id, output\_index]}. Output index is zero-based.
For example this task graph:
[{'d': 200, 'e_d': 180, 'cpus': 1, 'outputs': [100], 'inputs': []},
{'d': 50, 'e_d': 60, 'cpus': 2, 'outputs': [], 'inputs': [[0, 0]]}]
contains two tasks. One requiring no input, single CPU core with estimated duration 180s, actual duration 200s and producing a single output of 100 MiB. And another one requiring as an input task0's 0-th output, requiring 2 CPU cores, producing no output with estimated duration 60s and actual duration 50s.
Parsing the data
In Python, to load the elementary task graph set run the following snippet:
import pandas as pd
graphs = pd.read_json("./elementary.zip")
If you have Estee installed, you can use its provided `json_deserialize`
function to parse the JSON encoded graphs into Estee TaskGraph data structure.
from estee.serialization.dask_json import json_deserialize
graph_json = graphs.loc[0, "graph"]
graph = json_deserialize(graph)
工作流任务图数据集
本数据集包含三类任务图,分别对应不同类型的任务工作流:
- 基础集(Elementary):包含简单的图结构,例如无依赖任务或简单的分叉-合并图。该集合用于测试调度启发式算法对常见于大型工作流中的基础图场景的响应表现。
- IRW集:其设计灵感源自现实世界工作流,例如机器学习交叉验证或MapReduce工作流。
- Pegasus集:源自Pegasus合成工作流生成器(Pegasus Synthetic Workflow Generators,https://github.com/pegasus-isi/WorkflowGenerator)所生成的任务图。
所有提供的任务图均为生成所得,且兼容ESTEE(https://github.com/It4innovations/estee),该工具可基于多种调度启发式算法与环境条件,在分布式系统中模拟任务图的执行过程。
## 数据格式
任务图存储在{elementary, irw, pegasus}.zip压缩包中,每个压缩包内包含对应任务图的JSON表示,各任务图包含以下字段:
`graph_name`:任务图名称
`graph_id`:唯一任务图标识符
`graph`:任务图表示形式——任务列表,每个任务以字典形式存储,包含以下键:
`d`:任务实际持续时长(单位:秒,浮点型数值)
`e_d`:用户预估的任务持续时长(单位:秒,浮点型数值)
`cpus`:任务所需CPU核心数(整型数值)
`outputs`:任务输出列表(整型列表,用于标识各任务输出的大小,单位:MiB)
`inputs`:任务输入列表,格式为[[task_id, output_index]],其中output_index为从0开始的索引。
例如以下任务图:
[{'d': 200, 'e_d': 180, 'cpus': 1, 'outputs': [100], 'inputs': []},
{'d': 50, 'e_d': 60, 'cpus': 2, 'outputs': [], 'inputs': [[0, 0]]}]
该图包含两个任务:其一无需输入,占用1个CPU核心,预估时长180秒,实际时长200秒,且输出1个大小为100 MiB的文件;其二依赖任务0的第0个输出,占用2个CPU核心,无输出,预估时长60秒,实际时长50秒。
## 数据解析
在Python环境中加载基础任务图集可运行以下代码片段:
import pandas as pd
graphs = pd.read_json("./elementary.zip")
若已安装ESTEE,可使用其提供的`json_deserialize`函数将JSON编码的任务图解析为ESTEE TaskGraph数据结构:
from estee.serialization.dask_json import json_deserialize
graph_json = graphs.loc[0, "graph"]
graph = json_deserialize(graph)
创建时间:
2020-01-24



