five

KevinDavidHayes/long-context-attention-labels-64k-128k

收藏
Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/KevinDavidHayes/long-context-attention-labels-64k-128k
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-classification language: - en tags: - long-context - attention-labeling - lookback-distance size_categories: - 10K<n<100K --- # Long-Context Attention Labels (64K & 128K) Attention-based document labels for long-context training data selection. ## Overview Each document is labeled with attention lookback metrics computed by running it through a model and measuring how far back each token attends (via top-k=10 head-averaged attention distances). | Run | Model | Context | Records | |-----|-------|---------|---------| | olmo3_64k | OLMo-3-1025-7B (stage2) | 65,536 tokens | ~5,000 | | olmo3_128k | OLMo-3-1025-7B (stage2) | 131,072 tokens | ~4,600 | | qwen25_64k | Qwen2.5-7B-Instruct | 65,536 tokens | ~1,900 | | qwen25_128k | Qwen2.5-7B-Instruct | 131,072 tokens | ~4,800 | ## Fields | Field | Description | |-------|-------------| | `domain` | Source domain (code, books, arxiv, web, govreport) | | `seq_len` | Sequence length in tokens | | `sequence_label` | Binary label (1=long-range attention, 0=short-range) using fixed 2048-token threshold | | `sequence_avg_max_lookback` | Mean of per-token max lookback distance across top-10 attended positions | | `sequence_avg_median_lookback` | Mean of per-token median lookback distance (primary metric) | | `sequence_distance_variance` | Variance of attention distances | ## Labeling Details - **Layers:** OLMo-3 [8, 16, 24], Qwen2.5 [7, 14, 21] (spread-3 across model depth) - **Top-k:** 10 (top 10 attended positions by head-averaged attention probability) - **Attention sink mitigation:** First 32 tokens ignored - **Attention implementation:** SDPA with FORCE_CAUSAL_SDPA=1 - **OLMo-3 context extension:** Linear RoPE scaling (8x for 64K, 16x for 128K) - **Source data:** LongMINO-prefiltered pool (code, books, arxiv, web, govreport) - **Platform:** OLCF Frontier (AMD MI250X) ## Usage The `sequence_label` field uses a fixed threshold (2048 tokens) which is not ideal for all model×pool combinations. We recommend percentile-based thresholding on `sequence_avg_median_lookback` instead — e.g., top 15% as positive. ## Note on record counts Qwen2.5 64K has fewer records (~1,900) because the labeling pipeline applied a neg:pos ratio cap (2:1) which dropped excess negatives. The raw attention metrics were computed for all 10,000 documents but most were not written out. A re-run with the cap disabled is planned.
提供机构:
KevinDavidHayes
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作