ibm-research/BPO-Bench
收藏Hugging Face2026-03-17 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/ibm-research/BPO-Bench
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- question-answering
language:
- en
tags:
- agent-benchmark
- tool-use
- recruiting
- bpo
size_categories:
- 10K<n<100K
---
# BPO Benchmark Dataset
Evaluation dataset for AI agents using recruiting analytics APIs. This benchmark tests an agent's ability to use tool APIs to answer questions about BPO (Business Process Outsourcing) recruiting data.
## Dataset Structure
### Files
- **candidate_data.parquet** (1.8 MB): 64k synthetic candidate records with recruiting funnel data
- **candidate_data.csv** (13.5 MB): Same data in CSV format for human inspection
- **tasks.json** (26 KB): 26 core evaluation tasks with ground truth
- **tasks_type_mismatch.json**: 3 tasks testing agent handling of unexpected data types
- **tasks_http_errors.json**: 4 tasks testing agent handling of HTTP error codes
- **tasks_schema_violations.json**: 4 tasks testing agent handling of schema violations
- **tasks_edge_cases.json**: 5 tasks testing agent handling of edge cases (large payloads, Unicode, deep nesting)
- **tasks_undocumented.json**: 3 tasks testing agent handling of undocumented API behaviors
- **large_response_fixture.json**: Fixture data for the oversized-payload edge case test
### Candidate Data Schema
| Column | Type | Description |
|--------|------|-------------|
| `candidate_id` | string | Unique candidate identifier |
| `requisition_id` | string | Job requisition ID (e.g., "05958BR") |
| `requisition_template_id` | string | Template for similar requisitions |
| `source_name` | string | Sourcing channel (LinkedIn, Dice, GitHub, etc.) |
| `applied_at` | datetime | Application timestamp |
| `reviewed` | bool | Whether candidate was reviewed |
| `sla_met` | bool | Whether SLA was met for review |
| `interviewed` | bool | Whether candidate was interviewed |
| `offer_extended` | bool | Whether offer was extended |
| `offer_accepted` | bool | Whether offer was accepted |
| `hired` | bool | Whether candidate was hired |
| `hire_date` | datetime | Date of hire (if hired) |
| `skills` | list | Candidate skills list |
| `department` | string | Job department |
| `seniority_level` | string | Job seniority level |
### Task Format
Each task in `tasks.json` has:
```json
{
"name": "task_1",
"description": "Task description and explanation",
"intent": "The question to answer",
"difficulty": "easy|medium|hard",
"expected_output": {
"response": "Expected answer text",
"keywords": ["keyword1", "keyword2|alternative"],
"tool_calls": [{"name": "api_endpoint", "args": {}}]
}
}
```
Keywords support OR matching with `|` separator.
## API Endpoints
The benchmark includes 32 API endpoints (13 core + 19 error-prone).
### Core Endpoints (13)
#### Candidate Source APIs (7)
1. `candidate_source_sla_per_source` - SLA performance by source
2. `candidate_source_total_hires_by_source` - Hire counts by source
3. `candidate_source_candidate_volume_by_source` - Candidate volume metrics
4. `candidate_source_funnel_conversion_by_source` - Funnel conversion rates
5. `candidate_source_metadata_and_timeframe` - Data timeframe and metadata
6. `candidate_source_definitions_and_methodology` - Metric definitions
7. `candidate_source_source_recommendation_summary` - Source recommendations
#### Skills APIs (6)
8. `skills_skill_analysis` - Skill statistics and correlations
9. `skills_skill_impact_fill_rate` - Skill impact on fill rate
10. `skills_skill_impact_sla` - Skill impact on SLA
11. `skills_skill_relevance_justification` - Skill relevance explanation
12. `skills_successful_posting_criteria` - Success criteria thresholds
13. `skills_data_sources_used` - Data sources and models used
### Error-Prone Endpoints (19)
These endpoints intentionally exhibit problematic behaviors to test agent resilience and error handling.
#### Type Mismatch (3)
14. `skills_skill_summary` - Returns plain string instead of JSON
15. `candidate_source_source_sla_score` - Returns numeric float instead of structured response
16. `candidate_source_inactive_sources` - Returns boolean or list depending on data state
#### HTTP Errors (4)
17. `candidate_source_candidate_pipeline_status` - Intermittently returns 404
18. `candidate_source_source_sla_check` - Returns 500 Internal Server Error
19. `candidate_source_funnel_status` - Returns 503 Service Unavailable
20. `candidate_source_bulk_source_data` - Returns 429 Too Many Requests
#### Schema Violations (4)
21. `skills_model_registry` - No output schema; returns untyped dict
22. `skills_skill_lookup` - Returns extra undeclared fields
23. `candidate_source_source_metrics_lite` - Randomly omits required fields
24. `candidate_source_volume_report` - Returns wrong field types (strings for numbers)
#### Edge Cases (5)
25. `candidate_source_full_candidate_details` - Oversized payload (~1MB)
26. `candidate_source_source_directory` - Unicode and special characters
27. `skills_skill_deep_analysis` - Deeply nested JSON (5+ levels)
28. `candidate_source_sla_extended` - Unexpected extra fields
29. `skills_analyze_skill_match` - Mismatched schema vs documentation
#### Undocumented Behaviors (3)
30. `candidate_source_requisition_details` - Non-standard error format
31. `candidate_source_list_all_sources` - Undocumented pagination
32. `candidate_source_batch_metrics` - Undocumented rate limiting headers
## Usage
### With the Evaluation Space
The easiest way to use this dataset is through the evaluation Space:
[ibm-research/bpo-benchmark-eval](https://huggingface.co/spaces/ibm-research/bpo-benchmark-eval)
### Programmatic Access
```python
from huggingface_hub import hf_hub_download
import pandas as pd
import json
repo = "ibm-research/bpo-benchmark"
# Download candidate data
parquet_path = hf_hub_download(repo, "candidate_data.parquet", repo_type="dataset")
# Download all task suites
task_files = [
"tasks.json",
"tasks_type_mismatch.json",
"tasks_http_errors.json",
"tasks_schema_violations.json",
"tasks_edge_cases.json",
"tasks_undocumented.json",
]
task_paths = {
f: hf_hub_download(repo, f, repo_type="dataset")
for f in task_files
}
# Load data
df = pd.read_parquet(parquet_path)
with open(task_paths["tasks.json"]) as f:
core_tasks = json.load(f)
print(f"Loaded {len(df)} candidates")
print(f"Loaded {len(core_tasks[0]['test_cases'])} core tasks")
print(f"Task suites: {list(task_paths.keys())}")
```
## Statistics
- **Candidates**: 64,000 records
- **Requisitions**: 1,047 unique
- **Sourcing Channels**: 7 (LinkedIn, Dice, GitHub, Indeed, Referral, CyberSec Jobs, Company Website)
- **Total API Endpoints**: 32 (13 core + 19 error-prone)
- **Core Evaluation Tasks**: 26 (Easy: 20, Medium: 3, Hard: 3)
- **Error-Prone Tasks**: 19 (Type Mismatch: 3, HTTP Errors: 4, Schema Violations: 4, Edge Cases: 5, Undocumented: 3)
- **Total Tasks**: 45
- **Time Range**: Oct 2023 - Mar 2025
# License
Apache 2.0
# Paper
From Benchmarks to Business Impact: Deploying IBM Generalist Agent in Enterprise Production
https://arxiv.org/abs/2510.23856
# Citation
```bibtex
@inproceedings{shlomov2025benchmarks,
title={From Benchmarks to Business Impact: Deploying IBM Generalist Agent in Enterprise Production},
author={Shlomov, Segev and Oved, Alon and Marreed, Sami and Levy, Ido and Akrabi, Offer and Yaeli, Avi and Str{\k{a}}k, {\L}ukasz and Koumpan, Elizabeth and Goldshtein, Yinon and Shapira, Eilam and others},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
year={2026}
}
```
提供机构:
ibm-research



