five

large-traversaal/DSBC-Queries-V2.0

收藏
Hugging Face2026-02-24 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/large-traversaal/DSBC-Queries-V2.0
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: Index dtype: int64 - name: Dataset dtype: string - name: Tasks dtype: string - name: Query_Raw dtype: string - name: Query_Clean dtype: string - name: Response_Expected dtype: string - name: Solution_Code dtype: string splits: - name: train num_bytes: 226484 num_examples: 217 download_size: 96604 dataset_size: 226484 configs: - config_name: default data_files: - split: train path: data/train-* --- # DSBC : Data Science task Benchmarking with Context Engineering This repository evaluates Large Language Models on the [DSBC (Data Science Benchmarking)](https://huggingface.co/datasets/large-traversaal/DSBC-Queries-V2.0) dataset. It systematically tests LLM capabilities in data science code generation by generating responses to complex data science questions and evaluating them using LLM-based judges. Github repository for evaluation: https://github.com/traversaal-ai/DSBC-Data-Science-Task-Evaluation/ ## Evaluation Results ### Answer Generation Settings - **Temperature**: 0.3 (used for all model generations) ### Model Performance Scores The following scores were obtained using LLM-as-Judge evaluation methodology: | Model | Score | |-------|--------| | claude-sonnet-4 | 0.751 | | gemini-2.5-pro | 0.608 | | gpt-5.1-codex | 0.728 | | gpt-o4-mini | 0.618 | | glm-4.5 | 0.673 | ![model_accuracy_comparison (2)](https://cdn-uploads.huggingface.co/production/uploads/671a73c557cee297452f8eba/Vu2EvJMOlK8HQjCRBdry2.png) ### Evaluation Settings - **Judge Model**: gemini-flash-2.0 from Vertex AI - **Judge Temperature**: 0.2 (default) ## Citation If you find Curator Evals useful, do not forget to cite us! ``` @misc{kadiyala2025dsbcdatascience, title={DSBC : Data Science task Benchmarking with Context engineering}, author={Ram Mohan Rao Kadiyala and Siddhant Gupta and Jebish Purbey and Giulio Martini and Ali Shafique and Suman Debnath and Hamza Farooq}, year={2025}, eprint={2507.23336}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2507.23336}, } ```

dataset_info: 特征: - 名称:索引(Index),数据类型:int64 - 名称:数据集(Dataset),数据类型:字符串 - 名称:任务(Tasks),数据类型:字符串 - 名称:原始查询(Query_Raw),数据类型:字符串 - 名称:清理后查询(Query_Clean),数据类型:字符串 - 名称:预期响应(Response_Expected),数据类型:字符串 - 名称:解决方案代码(Solution_Code),数据类型:字符串 数据集划分: - 名称:训练集(train),字节数:226484,样本数量:217 下载大小:96604,数据集总大小:226484 配置: - 配置名称:默认(default),数据文件: - 划分:训练集,路径:data/train-* --- # DSBC:结合上下文工程的数据分析任务基准测试 本仓库针对[DSBC(数据分析基准测试集,Data Science Benchmarking)](https://huggingface.co/datasets/large-traversaal/DSBC-Queries-V2.0)数据集对大语言模型(Large Language Model,LLM)进行性能评估。本项目通过生成复杂数据分析问题的响应,并基于大语言模型作为评判器进行结果评估,系统性地测试大语言模型在数据分析代码生成领域的能力。 评估所用代码仓库:https://github.com/traversaal-ai/DSBC-Data-Science-Task-Evaluation/ ## 评估结果 ### 答案生成设置 - **温度系数(Temperature)**:0.3(所有模型生成均采用此参数) ### 模型性能得分 本次评估采用「大语言模型作为评判器(LLM-as-Judge)」的评估方法,得到以下得分: | 模型 | 得分 | |-------|--------| | claude-sonnet-4 | 0.751 | | gemini-2.5-pro | 0.608 | | gpt-5.1-codex | 0.728 | | gpt-o4-mini | 0.618 | | glm-4.5 | 0.673 | ![模型准确率对比图(model_accuracy_comparison (2))](https://cdn-uploads.huggingface.co/production/uploads/671a73c557cee297452f8eba/Vu2EvJMOlK8HQjCRBdry2.png) ### 评估设置 - **评判模型(Judge Model)**:Vertex AI 提供的 gemini-flash-2.0 - **评判温度系数(Judge Temperature)**:0.2(默认参数) ## 引用说明 若您认为Curator Evals对你有所帮助,请不要忘记引用我们的工作! @misc{kadiyala2025dsbcdatascience, title={DSBC : Data Science task Benchmarking with Context engineering}, author={Ram Mohan Rao Kadiyala and Siddhant Gupta and Jebish Purbey and Giulio Martini and Ali Shafique and Suman Debnath and Hamza Farooq}, year={2025}, eprint={2507.23336}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2507.23336}, }
提供机构:
large-traversaal
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作