ProfBench

Name: ProfBench
Creator: maas
Published: 2025-12-04 16:54:49
License: 暂无描述

魔搭社区2025-12-04 更新2025-11-03 收录

下载链接：

https://modelscope.cn/datasets/nv-community/ProfBench

下载链接

链接失效反馈

官方服务：

资源简介：

## Dataset Description: [Leaderboard](https://huggingface.co/spaces/nvidia/ProfBench) | [Blog](https://huggingface.co/blog/nvidia/profbench) | [Paper](https://arxiv.org/abs/2510.18941) | [Data](https://huggingface.co/datasets/nvidia/ProfBench) | [Code](https://github.com/NVlabs/ProfBench) | [Nemo Evaluator SDK](https://github.com/NVIDIA-NeMo/Evaluator) [![Watch the video](https://img.youtube.com/vi/GEPvdq3C54s/maxresdefault.jpg)](https://www.youtube.com/watch?v=GEPvdq3C54s) More than 3000 rubric criteria across 40 human-annotated tasks presenting reports addressing professional tasks across PhD STEM (Chemistry, Physics) and Professional Services (Financial Services, Management Consulting) domains. This dataset is ready for commercial/non-commercial use. ## Dataset Owner(s): NVIDIA Corporation ## Dataset Creation Date: 9/24/2025 ## License/Terms of Use: NVIDIA Evaluation Dataset License ## Intended Usage: Researchers and developers seeking to evaluate LLMs on Professional Tasks. We recommend use of ProfBench as part of [Nemo Evaluator SDK](https://github.com/NVIDIA-NeMo/Evaluator), which supports a unified interface for evaluation across tens of benchmarks. ## Dataset Characterization: ** Data Collection Method * [Hybrid: Human, Synthetic, Automated] ** Labeling Method * [Human] ## Dataset Format: Text. ## Dataset Quantification: 40 records Each record contains the following fields: - ID: Unique identifier for each sample - Domain: Chemistry PhD / Physics PhD / Finance MBA / Consulting MBA - Prompt: Instruction for the Large Language Model (LLM) - Rubrics: 15-59 unique criterion used to assess the final model output - Model Responses: 3 responses from OpenAI o3 / xAI Grok4 / DeepSeek R1-0528 Some portions of this dataset were created with Grok. Total Storage: 1 MB. ## Ethical Considerations: NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). ## Citation: If you found ProfBench helpful, please consider citing the below: ``` @misc{wang2025profbenchmultidomainrubricsrequiring, title={ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge}, author={Zhilin Wang and Jaehun Jung and Ximing Lu and Shizhe Diao and Ellie Evans and Jiaqi Zeng and Pavlo Molchanov and Yejin Choi and Jan Kautz and Yi Dong}, year={2025}, eprint={2510.18941}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.18941}, } ```

## 数据集描述： [排行榜（Leaderboard）](https://huggingface.co/spaces/nvidia/ProfBench) | [博客（Blog）](https://huggingface.co/blog/nvidia/profbench) | [论文（Paper）](https://arxiv.org/abs/2510.18941) | [数据（Data）](https://huggingface.co/datasets/nvidia/ProfBench) | [代码（Code）](https://github.com/NVlabs/ProfBench) | [Nemo评估器SDK（Nemo Evaluator SDK）](https://github.com/NVIDIA-NeMo/Evaluator) [![观看视频](https://img.youtube.com/vi/GEPvdq3C54s/maxresdefault.jpg)](https://www.youtube.com/watch?v=GEPvdq3C54s) 本数据集包含40项经人工标注的任务，涵盖总计超3000条评分准则，任务内容为针对博士阶段理工科（化学、物理）及专业服务领域（金融服务、管理咨询）的专业任务撰写报告。本数据集可用于商业及非商业用途。 ## 数据集所有者：英伟达公司（NVIDIA Corporation） ## 数据集创建日期： 2025年9月24日 ## 使用许可条款：英伟达评估数据集许可协议 ## 预期用途：供研究人员与开发者评估大语言模型（LLM）在专业任务上的表现。我们推荐将ProfBench与[Nemo评估器SDK（Nemo Evaluator SDK）](https://github.com/NVIDIA-NeMo/Evaluator)结合使用，该工具支持针对数十个基准测试的统一评估接口。 ## 数据集特征： ** 数据收集方式 * [混合模式：人工、合成、自动化] ** 标注方式 * [人工标注] ## 数据集格式：文本格式。 ## 数据集规模：共40条记录每条记录包含以下字段： - ID：每个样本的唯一标识符 - 领域：博士阶段化学 / 博士阶段物理 / 金融工商管理硕士 / 咨询工商管理硕士 - 提示词：面向大语言模型的指令 - 评分准则：15至59条独特的评估维度，用于评判模型最终输出结果 - 模型响应：分别来自OpenAI o3、xAI Grok4、DeepSeek R1-0528的3条响应结果本数据集的部分内容由Grok生成。总存储大小：1兆字节（MB）。 ## 伦理考量：英伟达认为可信人工智能（Trustworthy AI）是一项共同责任，我们已建立相关政策与实践规范，以支持各类人工智能应用的开发。开发者在按照本服务条款下载或使用本数据集时，应与其内部模型团队协作，确保所使用的模型符合相关行业与应用场景的要求，并防范未预见的产品滥用问题。若需反馈模型质量、风险、安全漏洞或英伟达人工智能相关问题，请[点击此处](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)提交。 ## 引用说明：若您认为ProfBench对您的研究有所帮助，请引用以下文献： @misc{wang2025profbenchmultidomainrubricsrequiring, title={ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge}, author={Zhilin Wang and Jaehun Jung and Ximing Lu and Shizhe Diao and Ellie Evans and Jiaqi Zeng and Pavlo Molchanov and Yejin Choi and Jan Kautz and Yi Dong}, year={2025}, eprint={2510.18941}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.18941}, }

提供机构：

maas

创建时间：

2025-10-23

5,000+

优质数据集

54 个

任务类型

进入经典数据集