SII-LancelotXie/DRIFT_QAFT
收藏Hugging Face2026-03-05 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/SII-LancelotXie/DRIFT_QAFT
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-3.0
task_categories:
- question-answering
language:
- en
tags:
- wikipedia
- drift
- qaft
- curriculum-learning
pretty_name: DRIFT_QAFT
configs:
- config_name: "1024-2048"
data_files:
- split: train
path: "1024-2048/train_qa.jsonl"
- split: validation
path: "1024-2048/val_qa.jsonl"
- split: test
path: "1024-2048/test_qa.jsonl"
- config_name: "2048-4096"
data_files:
- split: train
path: "2048-4096/train_qa.jsonl"
- split: validation
path: "2048-4096/val_qa.jsonl"
- split: test
path: "2048-4096/test_qa.jsonl"
- config_name: "4096-8192"
data_files:
- split: train
path: "4096-8192/train_qa.jsonl"
- split: validation
path: "4096-8192/val_qa.jsonl"
- split: test
path: "4096-8192/test_qa.jsonl"
---
# DRIFT_QAFT Dataset
DRIFT_QAFT is a Question-Answering dataset designed for the **DRIFT** project, which focuses on decoupling knowledge and reasoning in large language models.
### Dataset Summary
This dataset is derived from the Wikipedia dataset released by Wikimedia on Hugging Face:
https://huggingface.co/datasets/wikimedia/wikipedia
The original data comes from Wikipedia snapshots provided by Wikimedia.
Entries are bucketed into long-context intervals based on the **Qwen2Tokenizer** token count.
## Associated Paper
This dataset is the official resource for the paper:
**[Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference](https://arxiv.org/abs/2602.10021)**.
### Features
Each entry contains:
- **Document**: The source Wikipedia entry segment.
- **Question**: An LLM-generated question based on the document.
- **Answer**: An LLM-generated answer to the question.
- **Evidence**: LLM-labeled segments from the document that support the answer.
### Usage
```python
from datasets import load_dataset
# Load a specific interval
dataset = load_dataset("SII-LancelotXie/DRIFT_QAFT", "1024-2048")
print(dataset["train"][0])
```
## Citation
If you find this dataset or the DRIFT framework useful in your research, please cite our work:
```bibtex
@misc{xie2026decoupledreasoningimplicitfact,
title={Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference},
author={Wenxuan Xie and Yujia Wang and Xin Tan and Chaochao Lu and Xia Hu and Xuhong Wang},
year={2026},
eprint={2602.10021},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={[https://arxiv.org/abs/2602.10021](https://arxiv.org/abs/2602.10021)},
}
```
提供机构:
SII-LancelotXie



