KevinDavidHayes/long-context-attention-labels-64k-128k

Name: KevinDavidHayes/long-context-attention-labels-64k-128k
Creator: KevinDavidHayes
Published: 2026-04-20 17:24:12
License: 暂无描述

Hugging Face2026-04-20 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/KevinDavidHayes/long-context-attention-labels-64k-128k

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-classification language: - en tags: - long-context - attention-labeling - lookback-distance size_categories: - 10K<n<100K --- # Long-Context Attention Labels (64K & 128K) Attention-based document labels for long-context training data selection. ## Overview Each document is labeled with attention lookback metrics computed by running it through a model and measuring how far back each token attends (via top-k=10 head-averaged attention distances). | Run | Model | Context | Records | |-----|-------|---------|---------| | olmo3_64k | OLMo-3-1025-7B (stage2) | 65,536 tokens | ~5,000 | | olmo3_128k | OLMo-3-1025-7B (stage2) | 131,072 tokens | ~4,600 | | qwen25_64k | Qwen2.5-7B-Instruct | 65,536 tokens | ~1,900 | | qwen25_128k | Qwen2.5-7B-Instruct | 131,072 tokens | ~4,800 | ## Fields | Field | Description | |-------|-------------| | `domain` | Source domain (code, books, arxiv, web, govreport) | | `seq_len` | Sequence length in tokens | | `sequence_label` | Binary label (1=long-range attention, 0=short-range) using fixed 2048-token threshold | | `sequence_avg_max_lookback` | Mean of per-token max lookback distance across top-10 attended positions | | `sequence_avg_median_lookback` | Mean of per-token median lookback distance (primary metric) | | `sequence_distance_variance` | Variance of attention distances | ## Labeling Details - **Layers:** OLMo-3 [8, 16, 24], Qwen2.5 [7, 14, 21] (spread-3 across model depth) - **Top-k:** 10 (top 10 attended positions by head-averaged attention probability) - **Attention sink mitigation:** First 32 tokens ignored - **Attention implementation:** SDPA with FORCE_CAUSAL_SDPA=1 - **OLMo-3 context extension:** Linear RoPE scaling (8x for 64K, 16x for 128K) - **Source data:** LongMINO-prefiltered pool (code, books, arxiv, web, govreport) - **Platform:** OLCF Frontier (AMD MI250X) ## Usage The `sequence_label` field uses a fixed threshold (2048 tokens) which is not ideal for all model×pool combinations. We recommend percentile-based thresholding on `sequence_avg_median_lookback` instead — e.g., top 15% as positive. ## Note on record counts Qwen2.5 64K has fewer records (~1,900) because the labeling pipeline applied a neg:pos ratio cap (2:1) which dropped excess negatives. The raw attention metrics were computed for all 10,000 documents but most were not written out. A re-run with the cap disabled is planned.

提供机构：

KevinDavidHayes

5,000+

优质数据集

54 个

任务类型

进入经典数据集