KevinDavidHayes/labeled-16k-attention
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/KevinDavidHayes/labeled-16k-attention
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-classification
language:
- en
size_categories:
- 1K<n<10K
tags:
- long-context
- attention-analysis
- data-filtering
---
# Attention-Labeled 16K Documents
Documents labeled by **attention lookback distance** at 16K context length, used for attention-based data filtering for long-context language model training.
## Overview
Each document was processed through Qwen2.5-Coder-7B-Instruct at 16,384 tokens, and the attention pattern at layer 27 was analyzed to measure how far back the model looks when processing the text. Documents where the model attends to distant tokens (high lookback) are labeled **positive/filtered** — these are hypothesized to be better training data for teaching long-context capabilities.
## Files
| File | Rows | Description |
|------|------|-------------|
| `filtered_128.parquet` | 1,164 | **Positive** — high attention lookback (threshold ≥ 128 tokens), diverse sources |
| `filtered_300.parquet` | 168 | **Positive** — stricter threshold (≥ 300 tokens) |
| `negative_128.parquet` | 1,085 | **Negative** — low attention lookback (below 128-token threshold) |
| `negative_300.parquet` | 168 | **Negative** — matched count for the 300-token threshold |
| `random_match128.parquet` | 1,164 | **Random baseline** — count-matched to filtered_128 |
| `random_match300.parquet` | 168 | **Random baseline** — count-matched to filtered_300 |
## Fields
| Field | Type | Description |
|-------|------|-------------|
| `text` | string | Full document text (all files) |
| `source` | string | Data source domain (filtered/random files only) |
| `sequence_avg_median_lookback` | float | Median attention lookback distance averaged across sequence positions (filtered/random files only) |
| `sequence_avg_max_lookback` | float | Max attention lookback distance averaged across sequence positions (filtered/random files only) |
## Source Domains
Documents were drawn from a pool of long-form documents (≥16K tokens) from multiple domains:
| Domain | Description |
|--------|-------------|
| `code` | Source code from The Stack |
| `arxiv` | Academic papers |
| `web` | FineWeb web text |
| `legal` | Legal documents (MultiLexSum) |
| `books` | Books (PG-19) |
| `encyclopedia` | Wikipedia |
| `government` | Government reports |
## Labeling Details
- **Model:** Qwen2.5-Coder-7B-Instruct
- **Context length:** 16,384 tokens
- **Attention layer:** 27 (final layer)
- **Metric:** Sequence-level average median lookback distance
- **Block size:** 4096 query × 4096 key
- **Filler tokens:** 128 (added to probe long-range attention)
## Usage
```python
from datasets import load_dataset
# Load all splits
ds = load_dataset("KevinDavidHayes/labeled-16k-attention")
# Or load specific files
filtered = load_dataset("KevinDavidHayes/labeled-16k-attention", data_files="data/filtered_128.parquet")
```
## Citation
If you use this dataset, please cite our work on attention-based data filtering for long-context training.
提供机构:
KevinDavidHayes



