five

trillionlabs/rBridge

收藏
Hugging Face2026-02-26 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/trillionlabs/rBridge
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation language: - en tags: - reasoning - logprobs - token-probability - rbridge - proxy-model - scaling-laws pretty_name: "rBridge Paper's Reasoning Traces & Token Logprobs" size_categories: - 1K<n<10K configs: - config_name: arc_challenge data_files: - split: test path: traces/arc_challenge/gpt4o_s1_new.json - config_name: cqa data_files: - split: test path: traces/cqa/gpt4o_s1.json - config_name: gsm8k data_files: - split: test path: traces/gsm8k/gpt4o_s1_new.json - config_name: humaneval data_files: - split: test path: traces/humaneval/gpt4o_s1.json - config_name: math500 data_files: - split: test path: traces/math500/gpt4o_s1_merged.json - config_name: mmlu_pro data_files: - split: test path: traces/mmlu_pro/gpt4o_s1_stem_2_new.json --- # 🌉 rBridge Paper's Reasoning Traces & Token Logprobs This dataset contains GPT-4o reasoning traces and token-level logprobs for six reasoning benchmarks, released as part of the [rBridge](https://github.com/trillion-labs/rBridge) project ([paper](https://arxiv.org/abs/2509.21013)). rBridge uses these traces as gold-label reasoning references. By computing a **weighted negative log-likelihood** over these traces — where each token is weighted by the frontier model's confidence — small proxy models (≤1B) can reliably predict the reasoning performance of much larger LLMs (7B–32B+). ## 📊 Benchmarks | Config | Benchmark | Traces | Logprob Tokens | Logprob Files | |---|---|---:|---:|---:| | `arc_challenge` | ARC-Challenge | 1,172 | 143,492 | 3 parts | | `cqa` | CommonsenseQA | 1,221 | 149,350 | 3 parts | | `gsm8k` | GSM8K | 1,319 | 178,309 | 3 parts | | `humaneval` | HumanEval | 164 | 42,290 | 1 file | | `math500` | MATH-500 | 495 | 174,636 | 3 parts | | `mmlu_pro` | MMLU-Pro (STEM) | 5,791 | 1,627,990 | 31 parts | | | **Total** | **10,162** | **2,316,067** | | ## 📁 File Structure Each benchmark directory under `traces/` contains: 1. **Reasoning traces** (`.json`) — A JSON array of GPT-4o completions with reasoning and final answers. 2. **Token logprobs** (`.jsonl`) — One row per token from the GPT-4o completion, with full top-k logprobs. Split into multiple parts for large benchmarks. ``` traces/ ├── arc_challenge/ │ ├── gpt4o_s1_new.json # reasoning traces │ ├── gpt4o_s1_new_logprobs_part01_of_03.jsonl # token logprobs │ ├── gpt4o_s1_new_logprobs_part02_of_03.jsonl │ └── gpt4o_s1_new_logprobs_part03_of_03.jsonl ├── cqa/ │ ├── gpt4o_s1.json │ └── gpt4o_s1_logprobs_part{01..03}_of_03.jsonl ├── gsm8k/ │ ├── gpt4o_s1_new.json │ └── gpt4o_s1_new_logprobs_part{01..03}_of_03.jsonl ├── humaneval/ │ ├── gpt4o_s1.json │ └── gpt4o_s1_logprobs.jsonl ├── math500/ │ ├── gpt4o_s1_merged.json │ └── gpt4o_s1_logprobs_part{01..03}_of_03_merged.jsonl └── mmlu_pro/ ├── gpt4o_s1_stem_2_new.json └── gpt4o_s1_stem_2_new_logprobs_part{01..31}_of_31.jsonl ``` ## 🔖 Schema ### Reasoning Traces (`.json`) Each entry in the JSON array contains: | Field | Type | Description | |---|---|---| | `doc_id` | int | Document index | | `sample_id` | int | Sample index | | `dataset` | string | Benchmark name | | `original_question` | string | Input question / prompt | | `expected_answer` | string | Reference answer text | | `ground_truth_final_answer` | string | Ground-truth label (e.g., `"C"`) | | `gpt4o_reasoning` | string | GPT-4o chain-of-thought reasoning | | `gpt4o_final_answer` | string | GPT-4o predicted answer | | `model` | string | Model identifier (`openai/gpt-4o`) | | `usage` | object | Token usage (prompt, completion, total) | | `subject` | string | Subject / category (where applicable) | | `level` | string | Difficulty level (where applicable) | Additional benchmark-specific fields (e.g., `choices`, `task_id`, `question_id`) vary by dataset. ### Token Logprobs (`.jsonl`) Each line represents one token from the GPT-4o completion: | Field | Type | Description | |---|---|---| | `doc_id` | int | Document index (links to trace entry) | | `sample_id` | int | Sample index (links to trace entry) | | `position` | int | Token position in the completion | | `token` | string | The token string | | `logprob` | float | Log-probability assigned by GPT-4o | | `prob` | float | Probability (exp of logprob) | | `top_logprobs` | list | Top-k alternative tokens with their logprobs and probs | ## 🚀 Usage ### Load reasoning traces by benchmark ```python from datasets import load_dataset # Load a specific benchmark ds = load_dataset("trillionlabs/rBridge", "arc_challenge", split="test") print(ds[0]["gpt4o_reasoning"]) # Load another benchmark ds = load_dataset("trillionlabs/rBridge", "mmlu_pro", split="test") ``` ### Download logprobs files ```python from huggingface_hub import hf_hub_download # Download a single logprobs file path = hf_hub_download( repo_id="trillionlabs/rBridge", filename="traces/arc_challenge/gpt4o_s1_new_logprobs_part01_of_03.jsonl", repo_type="dataset", ) # Read it import json with open(path) as f: for line in f: token_data = json.loads(line) print(token_data["token"], token_data["prob"]) break ``` ### Download everything ```python from huggingface_hub import snapshot_download snapshot_download(repo_id="trillionlabs/rBridge", repo_type="dataset") ``` ## 🔗 Related Resources - **Paper**: [Predicting LLM Reasoning Performance with Small Proxy Model](https://arxiv.org/abs/2509.21013) ## 📝 Citation ```bibtex @inproceedings{ koh2026predicting, title={Predicting {LLM} Reasoning Performance with Small Proxy Model}, author={Woosung Koh and Juyoung Suk and Sungjun Han and Se-Young Yun and Jay Shin}, booktitle={The Fourteenth International Conference on Learning Representations}, year={2026}, url={https://openreview.net/forum?id=JSE40ljyKm} } ```
提供机构:
trillionlabs
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作