judges-verdict
收藏魔搭社区2025-12-04 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/nv-community/judges-verdict
下载链接
链接失效反馈官方服务:
资源简介:
## Dataset Description:
Judge's Verdict is a dataset to evaluate a LLM’s capability to judge answer quality against a reference.
It’s based on reduced versions of data from TechQA, HotpotQA, SQuAD2.0, and Enterprise-Knowledge RAG (EKRAG) datasets for queries and their ground truths. It adds model generated answers for the queries and human annotations by comparing the generated answers against the ground truths.
This dataset is ready for commercial/non-commercial use.
## Dataset Owner(s):
NVIDIA Corporation
## Dataset Creation Date:
09/24/2025
## License/Terms of Use:
GOVERNING TERMS: This dataset is governed by the Creative Commons Attribution-ShareAlike 4.0 International ([CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/deed.en)). ADDITIONAL INFORMATION: Apache 2.0 LIcense (https://choosealicense.com/licenses/apache-2.0/)
This dataset contains synthetic data created using meta/llama-3.1-70b-instruct and meta/llama-3.1-8b-instruct. If this dataset is used to create, train, fine tune, or otherwise improve an AI model, which is distributed or made available, such AI model may be subject to redistribution and use requirements in the Llama 3.1 Community License Agreement (https://www.llama.com/llama3_1/license/).
## Intended Usage:
This dataset is particularly well-suited for benchmarking LLMs’ capabilities to judge answer quality against a reference.
## Dataset Characterization
| Aspect | Details |
|-----------------------|----------------|
| Data Collection Method | Hybrid: Automated, Human |
| Labeling Method | Human |
## Dataset Format
The dataset is composed of .json files.
## Dataset Quantification
| Metric | Value |
|--------------------|--------------|
| Record Count | 1994 |
| Feature Count | 6 |
| Features | ['item_name', 'dataset_name', 'question', 'gt_answer', 'gen_answer', 'annotations'] |
| Data Storage Size | 2.34 MB |
## Reference(s):
- TechQA: https://github.com/ibm/techqa
- HotpotQA: https://huggingface.co/datasets/hotpotqa/hotpot_qa
- SQuAD2.0: https://rajpurkar.github.io/SQuAD-explorer/
- Enterprise-Knowledge RAG (EKRAG): The Enterprise RAG Benchmark dataset contains 3,629 questions designed to assess RAG system performance, drawn from 5,000 publicly available corporate documents in PDF, HTML, .docx, and .txt formats, including web pages, earnings transcripts, and SEC reports. Documents are categorized into Corporate News and Blogs, Corporate Technical Blogs, leadership communications, and SEC 10-K/8-K filings. https://aclanthology.org/2025.knowledgenlp-1.13/
## Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
# 数据集描述:
“法官裁决”数据集(Judge's Verdict)是一款用于评估大语言模型(Large Language Model,LLM)依据参考文本评判答案质量能力的数据集。
该数据集基于TechQA、HotpotQA、SQuAD2.0以及企业知识检索增强生成(Enterprise-Knowledge RAG,EKRAG)数据集的精简版,用于获取查询语句及其标准答案。数据集还为查询语句添加了模型生成的答案,并通过将生成答案与标准答案进行对比,引入了人工标注数据。本数据集可免费用于商业与非商业用途。
# 数据集所有者:
英伟达公司(NVIDIA Corporation)
# 数据集创建日期:
2025年9月24日
# 使用许可条款:
适用条款:本数据集受知识共享署名-相同方式共享4.0国际许可协议(Creative Commons Attribution-ShareAlike 4.0 International,CC BY-SA 4.0)约束。补充说明:同时适用Apache 2.0许可证。
本数据集包含使用meta/llama-3.1-70b-instruct与meta/llama-3.1-8b-instruct生成的合成数据。若使用本数据集创建、训练、微调或以其他方式改进人工智能模型,并对该模型进行分发或公开提供,则该人工智能模型需遵守《Llama 3.1社区许可协议》中的重新分发与使用要求。
# 预期用途:
本数据集尤其适用于基准测试大语言模型依据参考文本评判答案质量的能力。
# 数据集特征
| 维度 | 详情 |
|-----------------------|----------------|
| 数据收集方式 | 混合模式:自动化采集 + 人工标注 |
| 标注方式 | 人工标注 |
# 数据集格式:
本数据集由.json格式文件组成。
# 数据集量化统计
| 指标 | 数值 |
|--------------------|--------------|
| 样本数量 | 1994 |
| 特征维度 | 6 |
| 特征项 | ["item_name", "dataset_name", "question", "gt_answer", "gen_answer", "annotations"] |
| 数据存储大小 | 2.34 MB |
# 参考文献:
- TechQA:https://github.com/ibm/techqa
- HotpotQA:https://huggingface.co/datasets/hotpotqa/hotpot_qa
- SQuAD2.0:https://rajpurkar.github.io/SQuAD-explorer/
- 企业知识检索增强生成(Enterprise-Knowledge RAG,EKRAG):该企业RAG基准数据集包含3629个用于评估检索增强生成系统性能的查询语句,数据源自5000份公开可用的企业文档,格式涵盖PDF、HTML、.docx及.txt,包括网页、盈利报告与美国证券交易委员会(SEC)报告。文档分为企业新闻与博客、企业技术博客、领导层沟通文件以及SEC 10-K/8-K备案文件四类。参考文献链接:https://aclanthology.org/2025.knowledgenlp-1.13/
# 伦理考量:
英伟达(NVIDIA)认为,可信人工智能是一项共同责任,我们已建立相关政策与实践规范,以支持各类人工智能应用的开发。开发者在按照本服务条款下载或使用本数据集时,应与内部模型团队协作,确保所开发的模型符合相关行业与应用场景的要求,并应对可能出现的产品误用问题。
请通过以下链接报告安全漏洞或英伟达人工智能相关问题:https://www.nvidia.com/en-us/support/submit-security-vulnerability/
提供机构:
maas
创建时间:
2025-10-09



