Replication Data for: SOCBench-SC: Automatic Benchmarking of LLM-Based Service Compositions

Name: Replication Data for: SOCBench-SC: Automatic Benchmarking of LLM-Based Service Compositions
Creator: Robin D. Pesl; Marco Aiello; Massimo Mecella; Jerin G. Mathew
License: 暂无描述

IEEE2026-04-17 收录

下载链接：

https://ieee-dataport.org/documents/replication-data-socbench-sc-automatic-benchmarking-llm-based-service-compositions

下载链接

链接失效反馈

官方服务：

资源简介：

Automated service composition integrates independent (Web) services into complex workflows. Classical approaches relied on state machines or AI planning, whereas modern service documentation uses semi-structured OpenAPI specifications that combine natural language with structured elements. Large Language Models (LLMs) show strong potential due to their advanced semantic understanding of such specifications, yet existing benchmarks address only service discovery and lack systematic methods to evaluate generated code. We contribute SOCBench-SC, the first static code analysis framework for systematically evaluating LLM-generated service compositions. Unlike manual or dynamic analysis, SOCBench-SC provides automated, scalable, and reproducible assessment of invoked endpoints, reducing manual effort and avoiding runtime errors. We implement a prototype for Python code using reaching definition analysis and apply it to the two public benchmarks SOCBench-D and RestBench. These combine natural language tasks with expected solution endpoints. For each benchmark case, we generate the service composition code using six state-of-the-art LLMs, which is then analyzed by SOCBench-SC. Correctness is assessed using an LLM judge with manual checks. Our results show that SOCBench-SC reliably identifies invoked endpoints and enables comparative LLM ranking. Larger LLMs achieve \u224880% F1 average endpoint correctness against benchmark solutions, while revealing systematic deficits in Retrieval-Augmented Generation (RAG) and LLM agents. These findings confirm SOCBench-SC as a promising automation framework for extending service discovery benchmarks to service compositions and a foundation for extensions such as workflow ordering and reliability.

提供机构：

Robin D. Pesl; Marco Aiello; Massimo Mecella; Jerin G. Mathew

5,000+

优质数据集

54 个

任务类型

进入经典数据集