large-traversaal/DSBC-Queries-V2.0
收藏Hugging Face2026-02-24 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/large-traversaal/DSBC-Queries-V2.0
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: Index
dtype: int64
- name: Dataset
dtype: string
- name: Tasks
dtype: string
- name: Query_Raw
dtype: string
- name: Query_Clean
dtype: string
- name: Response_Expected
dtype: string
- name: Solution_Code
dtype: string
splits:
- name: train
num_bytes: 226484
num_examples: 217
download_size: 96604
dataset_size: 226484
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# DSBC : Data Science task Benchmarking with Context Engineering
This repository evaluates Large Language Models on the [DSBC (Data Science Benchmarking)](https://huggingface.co/datasets/large-traversaal/DSBC-Queries-V2.0) dataset. It systematically tests LLM capabilities in data science code generation by generating responses to complex data science questions and evaluating them using LLM-based judges.
Github repository for evaluation: https://github.com/traversaal-ai/DSBC-Data-Science-Task-Evaluation/
## Evaluation Results
### Answer Generation Settings
- **Temperature**: 0.3 (used for all model generations)
### Model Performance Scores
The following scores were obtained using LLM-as-Judge evaluation methodology:
| Model | Score |
|-------|--------|
| claude-sonnet-4 | 0.751 |
| gemini-2.5-pro | 0.608 |
| gpt-5.1-codex | 0.728 |
| gpt-o4-mini | 0.618 |
| glm-4.5 | 0.673 |

### Evaluation Settings
- **Judge Model**: gemini-flash-2.0 from Vertex AI
- **Judge Temperature**: 0.2 (default)
## Citation
If you find Curator Evals useful, do not forget to cite us!
```
@misc{kadiyala2025dsbcdatascience,
title={DSBC : Data Science task Benchmarking with Context engineering},
author={Ram Mohan Rao Kadiyala and Siddhant Gupta and Jebish Purbey and Giulio Martini and Ali Shafique and Suman Debnath and Hamza Farooq},
year={2025},
eprint={2507.23336},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2507.23336},
}
```
dataset_info:
特征:
- 名称:索引(Index),数据类型:int64
- 名称:数据集(Dataset),数据类型:字符串
- 名称:任务(Tasks),数据类型:字符串
- 名称:原始查询(Query_Raw),数据类型:字符串
- 名称:清理后查询(Query_Clean),数据类型:字符串
- 名称:预期响应(Response_Expected),数据类型:字符串
- 名称:解决方案代码(Solution_Code),数据类型:字符串
数据集划分:
- 名称:训练集(train),字节数:226484,样本数量:217
下载大小:96604,数据集总大小:226484
配置:
- 配置名称:默认(default),数据文件:
- 划分:训练集,路径:data/train-*
---
# DSBC:结合上下文工程的数据分析任务基准测试
本仓库针对[DSBC(数据分析基准测试集,Data Science Benchmarking)](https://huggingface.co/datasets/large-traversaal/DSBC-Queries-V2.0)数据集对大语言模型(Large Language Model,LLM)进行性能评估。本项目通过生成复杂数据分析问题的响应,并基于大语言模型作为评判器进行结果评估,系统性地测试大语言模型在数据分析代码生成领域的能力。
评估所用代码仓库:https://github.com/traversaal-ai/DSBC-Data-Science-Task-Evaluation/
## 评估结果
### 答案生成设置
- **温度系数(Temperature)**:0.3(所有模型生成均采用此参数)
### 模型性能得分
本次评估采用「大语言模型作为评判器(LLM-as-Judge)」的评估方法,得到以下得分:
| 模型 | 得分 |
|-------|--------|
| claude-sonnet-4 | 0.751 |
| gemini-2.5-pro | 0.608 |
| gpt-5.1-codex | 0.728 |
| gpt-o4-mini | 0.618 |
| glm-4.5 | 0.673 |

### 评估设置
- **评判模型(Judge Model)**:Vertex AI 提供的 gemini-flash-2.0
- **评判温度系数(Judge Temperature)**:0.2(默认参数)
## 引用说明
若您认为Curator Evals对你有所帮助,请不要忘记引用我们的工作!
@misc{kadiyala2025dsbcdatascience,
title={DSBC : Data Science task Benchmarking with Context engineering},
author={Ram Mohan Rao Kadiyala and Siddhant Gupta and Jebish Purbey and Giulio Martini and Ali Shafique and Suman Debnath and Hamza Farooq},
year={2025},
eprint={2507.23336},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2507.23336},
}
提供机构:
large-traversaal



