five

aldea-ai/mrcr-binned

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/aldea-ai/mrcr-binned
下载链接
链接失效反馈
官方服务:
资源简介:
# MRCR Binned Dataset Re-binned version of [openai/mrcr](https://huggingface.co/datasets/openai/mrcr) with accurate token counts using the `o200k_base` tokenizer (tiktoken). ## Bin Boundaries Bins are determined by the total number of `o200k_base` tokens in prompt + answer, matching the official MRCR methodology. | Bin | Token Range | Label | |-----|-------------|-------| | 0 | (0, 4,096] | 4k | | 1 | (4,096, 8,192] | 8k | | 2 | (8,192, 16,384] | 16k | | 3 | (16,384, 32,768] | 32k | | 4 | (32,768, 65,536] | 64k | | 5 | (65,536, 131,072] | 128k | | 6 | (131,072, 262,144] | 256k | | 7 | (262,144, 524,288] | 512k | | 8 | (524,288, 1,048,576] | 1M | ## Columns All original columns from `openai/mrcr` are preserved, plus: - **`o200k_tokens`**: Exact token count (prompt + answer) using `tiktoken.get_encoding("o200k_base")` - **`o200k_bin`**: Integer bin index (0-8) based on the boundaries above - **`o200k_bin_label`**: Human-readable bin label ## Needle Counts - 2-needle: 800 samples - 4-needle: 800 samples - 8-needle: 800 samples 100 samples per bin per needle count. ## Source Derived from [openai/mrcr](https://huggingface.co/datasets/openai/mrcr). See that dataset for full details on the MRCR benchmark.
提供机构:
aldea-ai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作