aldea-ai/mrcr-binned
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/aldea-ai/mrcr-binned
下载链接
链接失效反馈官方服务:
资源简介:
# MRCR Binned Dataset
Re-binned version of [openai/mrcr](https://huggingface.co/datasets/openai/mrcr) with accurate token counts using the `o200k_base` tokenizer (tiktoken).
## Bin Boundaries
Bins are determined by the total number of `o200k_base` tokens in prompt + answer, matching the official MRCR methodology.
| Bin | Token Range | Label |
|-----|-------------|-------|
| 0 | (0, 4,096] | 4k |
| 1 | (4,096, 8,192] | 8k |
| 2 | (8,192, 16,384] | 16k |
| 3 | (16,384, 32,768] | 32k |
| 4 | (32,768, 65,536] | 64k |
| 5 | (65,536, 131,072] | 128k |
| 6 | (131,072, 262,144] | 256k |
| 7 | (262,144, 524,288] | 512k |
| 8 | (524,288, 1,048,576] | 1M |
## Columns
All original columns from `openai/mrcr` are preserved, plus:
- **`o200k_tokens`**: Exact token count (prompt + answer) using `tiktoken.get_encoding("o200k_base")`
- **`o200k_bin`**: Integer bin index (0-8) based on the boundaries above
- **`o200k_bin_label`**: Human-readable bin label
## Needle Counts
- 2-needle: 800 samples
- 4-needle: 800 samples
- 8-needle: 800 samples
100 samples per bin per needle count.
## Source
Derived from [openai/mrcr](https://huggingface.co/datasets/openai/mrcr). See that dataset for full details on the MRCR benchmark.
提供机构:
aldea-ai



