five

LLM4DS Benchmark Dataset and Analysis Scripts (814 DS problems)

收藏
DataCite Commons2026-05-05 更新2026-05-07 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.20032399
下载链接
链接失效反馈
官方服务:
资源简介:
This archive supports the paper “Benchmarking Large Language Models for Data Science Coding: A Multidimensional Evaluation.”   It contains the experimental dataset and analysis code for evaluating seven large language models on 814 Python data science coding problems: 71 algorithmic, 660 analytical, and 83 visualization problems. The released JSON files include the problem identifiers used in the experiment, task categories, difficulty levels, task metadata required for analysis, and persisted model-attempt records with execution/evaluation outcomes across up to three attempts per model/problem pair.   The archive includes scripts for reproducing the paper’s results, including overall success rate, pass@1, performance by difficulty and task type, analytical execution time, visualization similarity, code similarity, feedback-guided retry behavior, output consistency, token usage, and cost per solved problem. The main reproducibility script is scripts/analysis.py, which generates summary tables, statistical tests, CSV outputs, and figures.   The evaluated models are Claude Sonnet 4.5, GPT-4.1, o3-mini, GPT-4o, Gemini 2.5 Pro, Perplexity Sonar, and Qwen3-Coder.   The repository also includes prompt templates, analysis outputs, generated figures, and supporting scripts. The Deno + TypeScript codebase was used as the original evaluation harness for prompting models, executing generated code, recording attempt-level outcomes, and persisting results. The Python analysis utilities reproduce the aggregate statistics, tables, and figures used in the paper.   The released data is intended to support reproducibility, secondary analysis, and comparison of LLM performance on multidimensional data science coding tasks.   Sensitive credentials and non-release data are not included.
提供机构:
Zenodo
创建时间:
2026-05-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作