KevinDavidHayes/long-context-attention-labels-64k-128k
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/KevinDavidHayes/long-context-attention-labels-64k-128k
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-classification
language:
- en
tags:
- long-context
- attention-labeling
- lookback-distance
size_categories:
- 10K<n<100K
---
# Long-Context Attention Labels (64K & 128K)
Attention-based document labels for long-context training data selection.
## Overview
Each document is labeled with attention lookback metrics computed by running it through a model and measuring how far back each token attends (via top-k=10 head-averaged attention distances).
| Run | Model | Context | Records |
|-----|-------|---------|---------|
| olmo3_64k | OLMo-3-1025-7B (stage2) | 65,536 tokens | ~5,000 |
| olmo3_128k | OLMo-3-1025-7B (stage2) | 131,072 tokens | ~4,600 |
| qwen25_64k | Qwen2.5-7B-Instruct | 65,536 tokens | ~1,900 |
| qwen25_128k | Qwen2.5-7B-Instruct | 131,072 tokens | ~4,800 |
## Fields
| Field | Description |
|-------|-------------|
| `domain` | Source domain (code, books, arxiv, web, govreport) |
| `seq_len` | Sequence length in tokens |
| `sequence_label` | Binary label (1=long-range attention, 0=short-range) using fixed 2048-token threshold |
| `sequence_avg_max_lookback` | Mean of per-token max lookback distance across top-10 attended positions |
| `sequence_avg_median_lookback` | Mean of per-token median lookback distance (primary metric) |
| `sequence_distance_variance` | Variance of attention distances |
## Labeling Details
- **Layers:** OLMo-3 [8, 16, 24], Qwen2.5 [7, 14, 21] (spread-3 across model depth)
- **Top-k:** 10 (top 10 attended positions by head-averaged attention probability)
- **Attention sink mitigation:** First 32 tokens ignored
- **Attention implementation:** SDPA with FORCE_CAUSAL_SDPA=1
- **OLMo-3 context extension:** Linear RoPE scaling (8x for 64K, 16x for 128K)
- **Source data:** LongMINO-prefiltered pool (code, books, arxiv, web, govreport)
- **Platform:** OLCF Frontier (AMD MI250X)
## Usage
The `sequence_label` field uses a fixed threshold (2048 tokens) which is not ideal for all model×pool combinations. We recommend percentile-based thresholding on `sequence_avg_median_lookback` instead — e.g., top 15% as positive.
## Note on record counts
Qwen2.5 64K has fewer records (~1,900) because the labeling pipeline applied a neg:pos ratio cap (2:1) which dropped excess negatives. The raw attention metrics were computed for all 10,000 documents but most were not written out. A re-run with the cap disabled is planned.
提供机构:
KevinDavidHayes



