five

KevinDavidHayes/labeled-16k-attention

收藏
Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/KevinDavidHayes/labeled-16k-attention
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-classification language: - en size_categories: - 1K<n<10K tags: - long-context - attention-analysis - data-filtering --- # Attention-Labeled 16K Documents Documents labeled by **attention lookback distance** at 16K context length, used for attention-based data filtering for long-context language model training. ## Overview Each document was processed through Qwen2.5-Coder-7B-Instruct at 16,384 tokens, and the attention pattern at layer 27 was analyzed to measure how far back the model looks when processing the text. Documents where the model attends to distant tokens (high lookback) are labeled **positive/filtered** — these are hypothesized to be better training data for teaching long-context capabilities. ## Files | File | Rows | Description | |------|------|-------------| | `filtered_128.parquet` | 1,164 | **Positive** — high attention lookback (threshold ≥ 128 tokens), diverse sources | | `filtered_300.parquet` | 168 | **Positive** — stricter threshold (≥ 300 tokens) | | `negative_128.parquet` | 1,085 | **Negative** — low attention lookback (below 128-token threshold) | | `negative_300.parquet` | 168 | **Negative** — matched count for the 300-token threshold | | `random_match128.parquet` | 1,164 | **Random baseline** — count-matched to filtered_128 | | `random_match300.parquet` | 168 | **Random baseline** — count-matched to filtered_300 | ## Fields | Field | Type | Description | |-------|------|-------------| | `text` | string | Full document text (all files) | | `source` | string | Data source domain (filtered/random files only) | | `sequence_avg_median_lookback` | float | Median attention lookback distance averaged across sequence positions (filtered/random files only) | | `sequence_avg_max_lookback` | float | Max attention lookback distance averaged across sequence positions (filtered/random files only) | ## Source Domains Documents were drawn from a pool of long-form documents (≥16K tokens) from multiple domains: | Domain | Description | |--------|-------------| | `code` | Source code from The Stack | | `arxiv` | Academic papers | | `web` | FineWeb web text | | `legal` | Legal documents (MultiLexSum) | | `books` | Books (PG-19) | | `encyclopedia` | Wikipedia | | `government` | Government reports | ## Labeling Details - **Model:** Qwen2.5-Coder-7B-Instruct - **Context length:** 16,384 tokens - **Attention layer:** 27 (final layer) - **Metric:** Sequence-level average median lookback distance - **Block size:** 4096 query × 4096 key - **Filler tokens:** 128 (added to probe long-range attention) ## Usage ```python from datasets import load_dataset # Load all splits ds = load_dataset("KevinDavidHayes/labeled-16k-attention") # Or load specific files filtered = load_dataset("KevinDavidHayes/labeled-16k-attention", data_files="data/filtered_128.parquet") ``` ## Citation If you use this dataset, please cite our work on attention-based data filtering for long-context training.
提供机构:
KevinDavidHayes
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作