SII-LancelotXie/DRIFT_LFRP
收藏Hugging Face2026-03-05 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/SII-LancelotXie/DRIFT_LFRP
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-3.0
task_categories:
- text-generation
language:
- en
tags:
- curriculum-learning
- wikipedia
- drift
pretty_name: DRIFT_LFRP
size_categories:
- 100K<n<1M
configs:
- config_name: "64_128"
data_files:
- split: train
path: 64_128/train.parquet
- split: validation
path: 64_128/validation.parquet
- split: test
path: 64_128/test.parquet
- config_name: "128_256"
data_files:
- split: train
path: 128_256/train.parquet
- split: validation
path: 128_256/validation.parquet
- split: test
path: 128_256/test.parquet
- config_name: "256_512"
data_files:
- split: train
path: 256_512/train.parquet
- split: validation
path: 256_512/validation.parquet
- split: test
path: 256_512/test.parquet
- config_name: "512_1024"
data_files:
- split: train
path: 512_1024/train.parquet
- split: validation
path: 512_1024/validation.parquet
- split: test
path: 512_1024/test.parquet
- config_name: "1024_2048"
data_files:
- split: train
path: 1024_2048/train.parquet
- split: validation
path: 1024_2048/validation.parquet
- split: test
path: 1024_2048/test.parquet
- config_name: "2048_4096"
data_files:
- split: train
path: 2048_4096/train.parquet
- split: validation
path: 2048_4096/validation.parquet
- split: test
path: 2048_4096/test.parquet
- config_name: "4096_8192"
data_files:
- split: train
path: 4096_8192/train.parquet
- split: validation
path: 4096_8192/validation.parquet
- split: test
path: 4096_8192/test.parquet
---
# DRIFT_LFRP
DRIFT_LFRP is a curriculum-style dataset constructed from the **English Wikipedia snapshot dated November 1, 2023**. The source dataset is released by Wikimedia on Hugging Face: https://huggingface.co/datasets/wikimedia/wikipedia
The original data comes from Wikipedia snapshots provided by Wikimedia.
Each example corresponds to a cleaned **Wikipedia entry segment**.
The dataset is organized by **token-length intervals**, where token counts are computed using **Qwen2Tokenizer**.
## Associated Paper
This dataset is the official resource for the paper:
**[Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference](https://arxiv.org/abs/2602.10021)**.
## Dataset Structure
Each subset corresponds to a token-length interval:
- 64_128
- 128_256
- 256_512
- 512_1024
Each subset contains:
- train
- validation
- test
Note: The additional subsets (1024-2048, 2048-4096 and 4096-8192 tokens) are currently used for constructing the QAFT (Query-Aware Fine-Tuning) task and were not utilized in the LFRP (Latent Fact Reconstruction Pretraining) data. However, as they share the same data format, they are included here for consistency and future extensions.
## Data Fields
| Field | Type | Description |
|------|------|-------------|
| context | string | Wikipedia text segment |
| token_count | int | Number of tokens computed with Qwen2Tokenizer |
## Usage
```python
from datasets import load_dataset
# 加载特定长度区间的数据集,例如 64-128
dataset = load_dataset("SII-LancelotXie/DRIFT_LFRP", "64_128")
# 查看数据
print(dataset["train"][0])
```
## Citation
If you find this dataset or the DRIFT framework useful in your research, please cite our work:
```bibtex
@misc{xie2026decoupledreasoningimplicitfact,
title={Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference},
author={Wenxuan Xie and Yujia Wang and Xin Tan and Chaochao Lu and Xia Hu and Xuhong Wang},
year={2026},
eprint={2602.10021},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={[https://arxiv.org/abs/2602.10021](https://arxiv.org/abs/2602.10021)},
}
```
提供机构:
SII-LancelotXie



