RJTR001/jee-neet-benchmark
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/RJTR001/jee-neet-benchmark
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- en
task_categories:
- visual-question-answering
- image-text-to-text
- question-answering
pretty_name: Indian Competitive Exams (JEE/NEET) LLM Benchmark
size_categories:
- n<1K
tags:
- education
- science
- india
- competitive-exams
- llm-benchmark
configs:
- config_name: default
data_files:
- split: test
path: "images/**"
drop_labels: true
dataset_info:
features:
- name: image
dtype: image
- name: question_id
dtype: string
- name: exam_name
dtype: string
- name: exam_year
dtype: int32
- name: subject
dtype: string
- name: question_type
dtype: string
- name: correct_answer
dtype: string
- name: paper_id
dtype: int64
splits:
- name: test
num_examples: 578
---
# JEE/NEET LLM Benchmark Dataset
[](https://opensource.org/licenses/MIT)
## Dataset Description
This repository contains a benchmark dataset designed for evaluating the capabilities of Large Language Models (LLMs) on questions from major Indian competitive examinations:
* **JEE (Main & Advanced):** Joint Entrance Examination for engineering.
* **NEET:** National Eligibility cum Entrance Test for medical fields.
The questions are presented in image format (`.png`) as they appear in the original papers. The dataset includes metadata linking each image to its corresponding exam details (name, year, subject, question type), and correct answer(s). The benchmark framework supports various question types including Single Correct MCQs, Multiple Correct MCQs (with partial marking for JEE Advanced), and Integer type questions.
**Current Data:**
* **NEET 2024** (Code T3): 200 questions across Physics, Chemistry, Botany, and Zoology
* **NEET 2025** (Code 45): 180 questions across Physics, Chemistry, Botany, and Zoology
* **JEE Advanced 2024** (Paper 1 & 2): 102 questions across Physics, Chemistry, and Mathematics
* **JEE Advanced 2025** (Paper 1 & 2): 96 questions across Physics, Chemistry, and Mathematics
* **Total:** 578 questions with comprehensive metadata
## Key Features
* **🖼️ Multimodal Reasoning:** Uses images of questions directly, testing the multimodal reasoning capability of models
* **📊 Exam-Specific Scoring:** Implements authentic scoring rules for different exams and question types, including partial marking for JEE Advanced
* **🔄 Robust API Handling:** Built-in retry mechanism and re-prompting for failed API calls or parsing errors
* **🎯 Flexible Filtering:** Filter by exam name, year, or specific question IDs for targeted evaluation
* **📈 Comprehensive Results:** Generates detailed JSON and human-readable Markdown summaries with section-wise breakdowns
* **🔧 Easy Configuration:** Simple YAML-based configuration for models and parameters
## Leaderboard
Generate an up-to-date leaderboard from your local results:
```bash
uv run python scripts/generate_leaderboard.py
```
See `scripts/generate_leaderboard.py --help` for options including `--min-questions` and `--output`.
## How to Use
### Using `datasets` Library
The dataset is hosted on the Hugging Face Hub and can be loaded directly:
```python
from datasets import load_dataset
import json
# Load the evaluation split
dataset = load_dataset("Reja1/jee-neet-benchmark", split='test')
# Example: Access the first question
example = dataset[0]
image = example["image"]
question_id = example["question_id"]
subject = example["subject"]
correct_answers = json.loads(example["correct_answer"]) # Parse JSON string
print(f"Question ID: {question_id}")
print(f"Subject: {subject}")
print(f"Correct Answer(s): {correct_answers}")
# Display the image (requires Pillow)
# image.show()
```
### Manual Usage (Benchmark Scripts)
This repository contains scripts to run the benchmark evaluation directly:
1. **Clone the repository:**
```bash
git clone https://huggingface.co/datasets/Reja1/jee-neet-benchmark
cd jee-neet-benchmark
# Ensure Git LFS is installed and pull large files
git lfs pull
```
2. **Install dependencies:**
```bash
uv sync
```
3. **Configure API Key:**
* Create a file named `.env` in the root directory of the project.
* Add your OpenRouter API key to this file:
```dotenv
OPENROUTER_API_KEY=your_actual_openrouter_api_key_here
```
* **Important:** The `.gitignore` file is already configured to prevent committing the `.env` file. Never commit your API keys directly.
4. **Configure Models:**
* Edit the `configs/benchmark_config.yaml` file.
* Modify the `openrouter_models` list to include the specific model identifiers you want to evaluate:
```yaml
openrouter_models:
- "google/gemini-2.5-pro-preview-03-25"
- "anthropic/claude-sonnet-4"
- "openai/o3"
```
* Ensure these models support vision input on OpenRouter.
* You can also adjust other parameters like `max_tokens` and `request_timeout` if needed.
5. **Run the benchmark:**
**Basic usage (run all available models on all questions):**
```bash
uv run python src/benchmark_runner.py --config configs/benchmark_config.yaml --model "google/gemini-2.5-pro-preview-03-25"
```
**Filter by exam and year:**
```bash
# Run only NEET 2024 questions
uv run python src/benchmark_runner.py --config configs/benchmark_config.yaml --model "openai/o3" --exam_name NEET --exam_year 2024
# Run only JEE Advanced 2025 questions
uv run python src/benchmark_runner.py --config configs/benchmark_config.yaml --model "anthropic/claude-sonnet-4" --exam_name JEE_ADVANCED --exam_year 2025
```
**Run specific questions:**
```bash
# Run specific question IDs
uv run python src/benchmark_runner.py --config configs/benchmark_config.yaml --model "google/gemini-2.5-pro-preview-03-25" --question_ids "N24T3001,N24T3002,JA24P1M01"
```
**Resume an interrupted run:**
```bash
# Resume from an existing results directory (skips already-completed questions)
uv run python src/benchmark_runner.py --model "google/gemini-2.5-pro-preview-03-25" --resume results/google_gemini-2.5-pro-preview-03-25_NEET_2024_20250524_141230
```
**Custom output directory:**
```bash
uv run python src/benchmark_runner.py --config configs/benchmark_config.yaml --model "openai/gpt-4o" --output_dir my_custom_results
```
**Available options:**
- `--exam_name`: Choose from `NEET`, `JEE_MAIN`, `JEE_ADVANCED`, or `all` (default)
- `--exam_year`: Choose from available years (`2024`, `2025`, etc.) or `all` (default)
- `--question_ids`: Comma-separated list of specific question IDs to evaluate (e.g., "N24T3001,JA24P1M01")
- `--resume`: Path to an existing results directory to resume an interrupted run
6. **Check Results:**
* Results for each model run will be saved in timestamped subdirectories within the `results/` folder.
* Each run's folder (e.g., `results/google_gemini-2.5-pro-preview-03-25_NEET_2024_20250524_141230/`) contains:
* **`predictions.jsonl`**: Raw API responses for each question including:
- Raw LLM responses
- API call success/failure information
- Parse success status and errors
* **`summary.jsonl`**: Per-question scored results including:
- Predicted answers and ground truth
- Evaluation status and marks awarded
* **`summary.md`**: Human-readable Markdown summary with:
- Overall exam scores
- Question type breakdown
- Section-wise breakdown (by subject)
- Detailed statistics on correct/incorrect/skipped questions
## Scoring System
The benchmark implements authentic scoring systems for each exam type:
### NEET Scoring
- **Single Correct MCQ**: +4 for correct, -1 for incorrect, 0 for skipped/API failure
### JEE Main Scoring
- **Single Correct MCQ**: +4 for correct, -1 for incorrect, 0 for skipped/API failure
- **Integer Type**: +4 for correct, 0 for incorrect, 0 for skipped/API failure
### JEE Advanced Scoring
- **Single Correct MCQ**: +3 for correct, -1 for incorrect, 0 for skipped/API failure
- **Multiple Correct MCQ**: Partial marking system:
- +4 for all correct options selected
- +3 for 3 out of 4 correct options (when 4 are correct)
- +2 for 2 out of 3+ correct options
- +1 for 1 out of 2+ correct options
- -2 for any incorrect option selected
- 0 for skipped/API failure
- **Integer Type**: +4 for correct, 0 for incorrect, 0 for skipped/API failure
> **Note:** API failures and parse failures are scored as 0 (no penalty) since they do not represent a deliberate wrong choice.
## Advanced Features
### Retry Mechanism
- Automatic retry for failed API calls (up to 3 attempts with exponential backoff)
- Retries on HTTP 429 (rate limit), 500, 502, 503, 504 status codes
- Separate retry pass for questions that failed initially
- Comprehensive error tracking and reporting
### Resume Capability
- Resume interrupted benchmark runs with `--resume <results_dir>`
- Reads existing `summary.jsonl` to identify completed questions and skips them
- Appends new results to the same output files
### Re-prompting System
- If initial response parsing fails, the system automatically re-prompts the model
- Uses the previous response to ask for properly formatted answers
- Shows only relevant format examples based on question type (MCQ single, MCQ multiple, or integer)
### Comprehensive Evaluation
- Tracks multiple metrics: correct answers, partial credit, skipped questions, API failures
- Section-wise breakdown by subject
- Color-coded progress indicators in terminal output
## Dataset Structure
* **`metadata.jsonl`**: Contains metadata for each question image with fields:
- `file_name`: Path to the question image (relative to repo root)
- `question_id`: Unique identifier (e.g., "N24T3001")
- `exam_name`: Exam type ("NEET", "JEE_MAIN", "JEE_ADVANCED")
- `exam_year`: Year of the exam (integer)
- `subject`: Subject name (e.g., "Physics", "Chemistry", "Mathematics")
- `question_type`: Question format ("MCQ_SINGLE_CORRECT", "MCQ_MULTIPLE_CORRECT", "INTEGER")
- `correct_answer`: JSON-serialized string of correct answers (e.g., `'["A"]'`, `'["B", "C"]'`, `'["42"]'`)
* **`images/`**: Contains subdirectories for each exam set:
- `images/NEET_2024_T3/`: NEET 2024 question images
- `images/NEET_2025_45/`: NEET 2025 question images
- `images/JEE_ADVANCED_2024/`: JEE Advanced 2024 question images
- `images/JEE_ADVANCED_2025/`: JEE Advanced 2025 question images
* **`src/`**: Python source code for the benchmark system:
- `benchmark_runner.py`: Main benchmark execution script
- `llm_interface.py`: OpenRouter API interface with retry logic
- `evaluation.py`: Scoring and evaluation functions
- `prompts.py`: LLM prompts for different question types
- `utils.py`: Utility functions for parsing and configuration
* **`configs/`**: Configuration files:
- `benchmark_config.yaml`: Model selection and API parameters
* **`results/`**: Directory where benchmark results are stored (timestamped subdirectories)
## Data Fields
The dataset contains the following fields (accessible via `datasets`):
* `image`: The question image (`datasets.Image`)
* `question_id`: Unique identifier for the question (string)
* `exam_name`: Name of the exam (e.g., "NEET", "JEE_ADVANCED") (string)
* `exam_year`: Year of the exam (int)
* `subject`: Subject (e.g., "Physics", "Chemistry", "Mathematics") (string)
* `question_type`: Type of question (e.g., "MCQ_SINGLE_CORRECT", "INTEGER") (string)
* `correct_answer`: JSON-serialized string containing the correct answer(s). Use `json.loads()` to parse.
- For MCQs, these are option identifiers (e.g., `'["1"]'`, `'["A"]'`, `'["B", "C"]'`). The LLM should output the identifier as it appears in the question.
- For INTEGER type, this is the numerical answer as a string (e.g., `'["42"]'`, `'["12.75"]'`). The LLM should output the number.
- For some `MCQ_SINGLE_CORRECT` questions, multiple answers in the list are considered correct if the LLM prediction matches any one of them.
## LLM Answer Format
The LLM is expected to return its answer enclosed in `<answer>` tags. For example:
- MCQ Single Correct (Option A): `<answer>A</answer>`
- MCQ Single Correct (Option 2): `<answer>2</answer>`
- MCQ Multiple Correct (Options B and D): `<answer>B,D</answer>`
- Integer Answer: `<answer>42</answer>`
- Decimal Answer: `<answer>12.75</answer>`
- Skipped Question: `<answer>SKIP</answer>
The system parses these formats. Prompts are designed to guide the LLM accordingly.
## Troubleshooting
### Common Issues
**API Key Issues:**
- Ensure your `.env` file is in the root directory
- Verify your OpenRouter API key is valid and has sufficient credits
- Check that the key has access to vision-capable models
**Model Not Found:**
- Verify the model identifier exists on OpenRouter
- Ensure the model supports vision input
- Check your OpenRouter account has access to the specific model
**Memory Issues:**
- Reduce `max_tokens` in the config file
- Process smaller subsets using `--question_ids` filter
- Use models with smaller context windows
**Parsing Failures:**
- The system automatically attempts re-prompting for parsing failures
- Check the raw responses in `predictions.jsonl` to debug prompt issues
- Consider adjusting prompts in `src/prompts.py` for specific models
## Limitations & Data Contamination
### Contamination Risk
This benchmark uses questions from publicly administered exams (JEE Advanced and NEET). These questions are widely published online after each exam and may appear in the training data of evaluated models, particularly for older exam years (e.g., 2024). High scores on this benchmark may therefore partially reflect memorization rather than genuine reasoning ability.
To help assess contamination effects:
- **Compare across years**: Models may score higher on older exams (2024) whose questions had more time to enter training data, compared to newer exams (2025).
- **Cross-reference with novel benchmarks**: Compare performance on this benchmark with contamination-resistant benchmarks like GPQA or Humanity's Last Exam.
This benchmark is best understood as an evaluation on **publicly available exam questions** rather than a contamination-free assessment of reasoning capability.
### JEE Main Support
The benchmark framework fully supports JEE Main scoring rules in code, but the current dataset does not include JEE Main questions. JEE Main support is available for users who wish to add their own JEE Main question sets.
### Other Limitations
- **Single prompt template**: Results may vary with different prompt formulations. The benchmark currently uses one prompt template per question type.
- **No multi-run variance**: Each model is evaluated once per exam. Results may vary slightly across runs due to non-deterministic model behavior.
- **Image quality dependence**: Performance may be affected by image resolution, scan quality, or the presence of artifacts in question images.
- **Language Support**: Currently only supports English questions.
- **Model Dependencies**: Requires models with vision capabilities available through OpenRouter.
## Citation
If you use this dataset or benchmark code, please cite:
```bibtex
@misc{rejaullah_2025_jeeneetbenchmark,
title={JEE/NEET LLM Benchmark},
author={Md Rejaullah},
year={2025},
howpublished={\url{https://huggingface.co/datasets/Reja1/jee-neet-benchmark}},
}
```
## Contact
For questions, suggestions, or collaboration, feel free to reach out:
* **X (Twitter):** [https://x.com/RejaullahmdMd](https://x.com/RejaullahmdMd)
## License
This dataset and associated code are licensed under the [MIT License](https://opensource.org/licenses/MIT).
提供机构:
RJTR001



