ibm-research/BPO-Bench

Name: ibm-research/BPO-Bench
Creator: ibm-research
Published: 2026-03-17 09:38:13
License: 暂无描述

Hugging Face2026-03-17 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/ibm-research/BPO-Bench

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - question-answering language: - en tags: - agent-benchmark - tool-use - recruiting - bpo size_categories: - 10K<n<100K --- # BPO Benchmark Dataset Evaluation dataset for AI agents using recruiting analytics APIs. This benchmark tests an agent's ability to use tool APIs to answer questions about BPO (Business Process Outsourcing) recruiting data. ## Dataset Structure ### Files - **candidate_data.parquet** (1.8 MB): 64k synthetic candidate records with recruiting funnel data - **candidate_data.csv** (13.5 MB): Same data in CSV format for human inspection - **tasks.json** (26 KB): 26 core evaluation tasks with ground truth - **tasks_type_mismatch.json**: 3 tasks testing agent handling of unexpected data types - **tasks_http_errors.json**: 4 tasks testing agent handling of HTTP error codes - **tasks_schema_violations.json**: 4 tasks testing agent handling of schema violations - **tasks_edge_cases.json**: 5 tasks testing agent handling of edge cases (large payloads, Unicode, deep nesting) - **tasks_undocumented.json**: 3 tasks testing agent handling of undocumented API behaviors - **large_response_fixture.json**: Fixture data for the oversized-payload edge case test ### Candidate Data Schema | Column | Type | Description | |--------|------|-------------| | `candidate_id` | string | Unique candidate identifier | | `requisition_id` | string | Job requisition ID (e.g., "05958BR") | | `requisition_template_id` | string | Template for similar requisitions | | `source_name` | string | Sourcing channel (LinkedIn, Dice, GitHub, etc.) | | `applied_at` | datetime | Application timestamp | | `reviewed` | bool | Whether candidate was reviewed | | `sla_met` | bool | Whether SLA was met for review | | `interviewed` | bool | Whether candidate was interviewed | | `offer_extended` | bool | Whether offer was extended | | `offer_accepted` | bool | Whether offer was accepted | | `hired` | bool | Whether candidate was hired | | `hire_date` | datetime | Date of hire (if hired) | | `skills` | list | Candidate skills list | | `department` | string | Job department | | `seniority_level` | string | Job seniority level | ### Task Format Each task in `tasks.json` has: ```json { "name": "task_1", "description": "Task description and explanation", "intent": "The question to answer", "difficulty": "easy|medium|hard", "expected_output": { "response": "Expected answer text", "keywords": ["keyword1", "keyword2|alternative"], "tool_calls": [{"name": "api_endpoint", "args": {}}] } } ``` Keywords support OR matching with `|` separator. ## API Endpoints The benchmark includes 32 API endpoints (13 core + 19 error-prone). ### Core Endpoints (13) #### Candidate Source APIs (7) 1. `candidate_source_sla_per_source` - SLA performance by source 2. `candidate_source_total_hires_by_source` - Hire counts by source 3. `candidate_source_candidate_volume_by_source` - Candidate volume metrics 4. `candidate_source_funnel_conversion_by_source` - Funnel conversion rates 5. `candidate_source_metadata_and_timeframe` - Data timeframe and metadata 6. `candidate_source_definitions_and_methodology` - Metric definitions 7. `candidate_source_source_recommendation_summary` - Source recommendations #### Skills APIs (6) 8. `skills_skill_analysis` - Skill statistics and correlations 9. `skills_skill_impact_fill_rate` - Skill impact on fill rate 10. `skills_skill_impact_sla` - Skill impact on SLA 11. `skills_skill_relevance_justification` - Skill relevance explanation 12. `skills_successful_posting_criteria` - Success criteria thresholds 13. `skills_data_sources_used` - Data sources and models used ### Error-Prone Endpoints (19) These endpoints intentionally exhibit problematic behaviors to test agent resilience and error handling. #### Type Mismatch (3) 14. `skills_skill_summary` - Returns plain string instead of JSON 15. `candidate_source_source_sla_score` - Returns numeric float instead of structured response 16. `candidate_source_inactive_sources` - Returns boolean or list depending on data state #### HTTP Errors (4) 17. `candidate_source_candidate_pipeline_status` - Intermittently returns 404 18. `candidate_source_source_sla_check` - Returns 500 Internal Server Error 19. `candidate_source_funnel_status` - Returns 503 Service Unavailable 20. `candidate_source_bulk_source_data` - Returns 429 Too Many Requests #### Schema Violations (4) 21. `skills_model_registry` - No output schema; returns untyped dict 22. `skills_skill_lookup` - Returns extra undeclared fields 23. `candidate_source_source_metrics_lite` - Randomly omits required fields 24. `candidate_source_volume_report` - Returns wrong field types (strings for numbers) #### Edge Cases (5) 25. `candidate_source_full_candidate_details` - Oversized payload (~1MB) 26. `candidate_source_source_directory` - Unicode and special characters 27. `skills_skill_deep_analysis` - Deeply nested JSON (5+ levels) 28. `candidate_source_sla_extended` - Unexpected extra fields 29. `skills_analyze_skill_match` - Mismatched schema vs documentation #### Undocumented Behaviors (3) 30. `candidate_source_requisition_details` - Non-standard error format 31. `candidate_source_list_all_sources` - Undocumented pagination 32. `candidate_source_batch_metrics` - Undocumented rate limiting headers ## Usage ### With the Evaluation Space The easiest way to use this dataset is through the evaluation Space: [ibm-research/bpo-benchmark-eval](https://huggingface.co/spaces/ibm-research/bpo-benchmark-eval) ### Programmatic Access ```python from huggingface_hub import hf_hub_download import pandas as pd import json repo = "ibm-research/bpo-benchmark" # Download candidate data parquet_path = hf_hub_download(repo, "candidate_data.parquet", repo_type="dataset") # Download all task suites task_files = [ "tasks.json", "tasks_type_mismatch.json", "tasks_http_errors.json", "tasks_schema_violations.json", "tasks_edge_cases.json", "tasks_undocumented.json", ] task_paths = { f: hf_hub_download(repo, f, repo_type="dataset") for f in task_files } # Load data df = pd.read_parquet(parquet_path) with open(task_paths["tasks.json"]) as f: core_tasks = json.load(f) print(f"Loaded {len(df)} candidates") print(f"Loaded {len(core_tasks[0]['test_cases'])} core tasks") print(f"Task suites: {list(task_paths.keys())}") ``` ## Statistics - **Candidates**: 64,000 records - **Requisitions**: 1,047 unique - **Sourcing Channels**: 7 (LinkedIn, Dice, GitHub, Indeed, Referral, CyberSec Jobs, Company Website) - **Total API Endpoints**: 32 (13 core + 19 error-prone) - **Core Evaluation Tasks**: 26 (Easy: 20, Medium: 3, Hard: 3) - **Error-Prone Tasks**: 19 (Type Mismatch: 3, HTTP Errors: 4, Schema Violations: 4, Edge Cases: 5, Undocumented: 3) - **Total Tasks**: 45 - **Time Range**: Oct 2023 - Mar 2025 # License Apache 2.0 # Paper From Benchmarks to Business Impact: Deploying IBM Generalist Agent in Enterprise Production https://arxiv.org/abs/2510.23856 # Citation ```bibtex @inproceedings{shlomov2025benchmarks, title={From Benchmarks to Business Impact: Deploying IBM Generalist Agent in Enterprise Production}, author={Shlomov, Segev and Oved, Alon and Marreed, Sami and Levy, Ido and Akrabi, Offer and Yaeli, Avi and Str{\k{a}}k, {\L}ukasz and Koumpan, Elizabeth and Goldshtein, Yinon and Shapira, Eilam and others}, booktitle={Proceedings of the AAAI Conference on Artificial Intelligence}, year={2026} } ```

提供机构：

ibm-research

5,000+

优质数据集

54 个

任务类型

进入经典数据集