agentscope-ai/OpenJudge

Name: agentscope-ai/OpenJudge
Creator: agentscope-ai
Published: 2026-03-04 12:41:38
License: 暂无描述

Hugging Face2026-03-04 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/agentscope-ai/OpenJudge

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en - zh license: apache-2.0 size_categories: - 1K<n<10K task_categories: - text-generation - reinforcement-learning - question-answering - image-to-text tags: - reward-modeling - evaluation - grading - preference-learning - agent-evaluation - multimodal pretty_name: OpenJudge Benchmark Dataset --- # OpenJudge Benchmark Dataset Benchmark dataset for evaluating graders across text, multimodal, and agent scenarios. This dataset supports the [OpenJudge framework](https://github.com/modelscope/OpenJudge) with labeled preference pairs for quality-assured grader development. ## Dataset Statistics ### Evaluation Benchmarks | Category | Task | Files | Samples | |:---------|:-----|------:|--------:| | **🤖 Agent** | | **12** | **166** | | | action | 1 | 8 | | | memory | 3 | 47 | | | plan | 1 | 7 | | | reflection | 3 | 52 | | | tool | 4 | 52 | | **🖼️ Multimodal** | | **4** | **80** | | | image_coherence | 1 | 20 | | | image_editing | 1 | 20 | | | image_helpfulness | 1 | 20 | | | text_to_image | 1 | 20 | | **📝 Text** | | **5** | **130** | | | correctness | 1 | 50 | | | hallucination | 1 | 20 | | | harmlessness | 1 | 20 | | | instruction_following | 1 | 20 | | | relevance | 1 | 20 | | **Eval Total** | | **21** | **376** | ### Training Data | Category | Split | Samples | Format | |:---------|:------|--------:|:-------| | **🎯 Bradley-Terry** | train | 1,000 | Parquet | | | test | 763 | Parquet | | **📚 SFT** | train | 1,000 | Parquet | | | test | 763 | Parquet | | **🔄 GRPO Pointwise** | train | 2,000 | Parquet | | | val | 1,526 | Parquet | | **🔄 GRPO Pairwise** | train | 1,000 | Parquet | | | val | 763 | Parquet | | **Train Total** | | **8,815** | | ## Dataset Structure ``` # Evaluation Benchmarks text/{task_type}/{task_type}_eval_v1.json multimodal/{task_type}/{task_type}_eval_v1.json agent/{task_category}/{task_name}.json # Training Data train_rm/bradley_terry/{train,test}.parquet train_rm/sft/{train,test}.parquet train_rm/grpo/pointwise/{train,val}.parquet train_rm/grpo/pairwise/{train,val}.parquet ``` ## Data Format Each JSON file contains an array of evaluation cases: ```json { "id": "unique_identifier", "dataset": "source_dataset_name", "task_type": "evaluation_task_type", "input": { "query": "user query or null", "context": "additional context or structured data", "reference": "ground truth or reference response", "media_contents": [], "metadata": {} }, "chosen": { "response": { "content": "preferred response", "model": "model_name", "model_type": "text|multimodal", "metadata": {} } }, "rejected": { "response": { "content": "dis-preferred response", "model": "model_name", "model_type": "text|multimodal", "metadata": {} } }, "human_ranking": [0, 1], "metadata": { "source": "source_information" } } ``` **Key Fields:** - `input`: Query, context, reference answer - `chosen`/`rejected`: Preference pair responses (may be null for agent data) - `human_ranking`: Preference ranking [chosen_idx, rejected_idx] - `metadata`: Task-specific metadata **Notes:** - **Text/Multimodal**: Standard preference pairs with `chosen` and `rejected` - **Agent**: Context contains trajectory data; either `chosen` or `rejected` may be null ### Training Data Format **Bradley-Terry** (`train_rm/bradley_terry/*.parquet`): | Column | Description | |:-------|:------------| | `chosen` | Preferred response | | `rejected` | Dis-preferred response | **SFT** (`train_rm/sft/*.parquet`): | Column | Description | |:-------|:------------| | `messages` | Conversation messages for supervised fine-tuning | | `data_source` | Source dataset identifier | | `extra_info` | Additional metadata | **GRPO Pointwise** (`train_rm/grpo/pointwise/*.parquet`): | Column | Description | |:-------|:------------| | `input` | Message list `[{"role": "user", "content": "..."}]` | | `output` | Response with label `[{"answer": {..., "label": {"helpfulness": 0-4}}}]` | | `source` | Data source (rewardbench2) | **GRPO Pairwise** (`train_rm/grpo/pairwise/*.parquet`): | Column | Description | |:-------|:------------| | `input` | Message list `[{"role": "user", "content": "..."}]` | | `output` | Two responses with preference label `[{"answer": {..., "label": {"preference": "A/B"}}}]` | | `source` | Data source (rewardbench2) | ## Usage ```python from datasets import load_dataset # Load entire dataset dataset = load_dataset("agentscope-ai/OpenJudge") # Load evaluation benchmarks text_data = load_dataset("agentscope-ai/OpenJudge", data_files="text/**/*.json") agent_data = load_dataset("agentscope-ai/OpenJudge", data_files="agent/**/*.json") multimodal_data = load_dataset("agentscope-ai/OpenJudge", data_files="multimodal/**/*.json") # Load training data bt_train = load_dataset("agentscope-ai/OpenJudge", data_files="train_rm/bradley_terry/train.parquet") sft_train = load_dataset("agentscope-ai/OpenJudge", data_files="train_rm/sft/train.parquet") # Load GRPO training data grpo_pointwise = load_dataset("agentscope-ai/OpenJudge", data_files="train_rm/grpo/pointwise/train.parquet") grpo_pairwise = load_dataset("agentscope-ai/OpenJudge", data_files="train_rm/grpo/pairwise/train.parquet") ``` ## Task Categories **Text:** Correctness, Hallucination, Harmlessness, Instruction Following, Relevance **Multimodal:** Image Coherence, Image Editing, Image Helpfulness, Text-to-Image **Agent:** Action Alignment, Memory (Accuracy/Retrieval/Preservation), Plan Feasibility, Reflection (Accuracy/Awareness/Understanding), Tool Use (Selection/Parameters/Success) ## Reproduce Evaluation Results Each task directory contains an `evaluate_*.py` script that allows you to reproduce the accuracy results using the corresponding OpenJudge grader. ### Run Single Evaluation ```bash # Set environment variables export OPENAI_API_KEY=your_api_key export OPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1 # Run specific grader evaluation cd text/correctness python evaluate_correctness.py --model qwen-max # Run with verbose output python evaluate_correctness.py --model qwen-max --verbose ``` ### Run All Evaluations (Parallel) Use `run_all_evaluations.py` to evaluate all graders concurrently: ```bash # Run all evaluations python run_all_evaluations.py # Run specific category python run_all_evaluations.py --category text python run_all_evaluations.py --category agent # Custom models and concurrency python run_all_evaluations.py --text-model qwen-max --agent-model qwen3-max --workers 5 # Save results to JSON python run_all_evaluations.py --output results.json ``` ### Expected Accuracy by Grader | Category | Grader | Model | Expected Accuracy | |:---------|:-------|:------|------------------:| | Text | CorrectnessGrader | qwen-max | 96-100% | | Text | HallucinationGrader | qwen-plus | 70-75% | | Text | HarmfulnessGrader | qwen-plus | 100% | | Text | InstructionFollowingGrader | qwen-max | 75-80% | | Text | RelevanceGrader | qwen-plus | 100% | | Multimodal | ImageCoherenceGrader | qwen-vl-max | 75% | | Multimodal | ImageHelpfulnessGrader | qwen-vl-max | 80% | | Multimodal | TextToImageGrader | qwen-vl-max | 75% | | Agent | ActionAlignmentGrader | qwen3-max | 88% | | Agent | PlanFeasibilityGrader | qwen3-max | 86% | | Agent | ToolGraders | qwen3-max | 75-95% | | Agent | MemoryGraders | qwen3-max | 76-100% | | Agent | ReflectionGraders | qwen3-max | 74-100% | ## License Apache 2.0 ## Citation ```bibtex @software{openjudge2025, title = {OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards}, author = {The OpenJudge Team}, url = {https://github.com/modelscope/OpenJudge}, year = {2025} } ``` ## Links - GitHub: [modelscope/OpenJudge](https://github.com/modelscope/OpenJudge) - Documentation: [modelscope.github.io/OpenJudge](https://modelscope.github.io/OpenJudge/) - PyPI: [py-openjudge](https://pypi.org/project/py-openjudge/)

提供机构：

agentscope-ai

5,000+

优质数据集

54 个

任务类型

进入经典数据集