SII-LancelotXie/DRIFT_QAFT

Name: SII-LancelotXie/DRIFT_QAFT
Creator: SII-LancelotXie
Published: 2026-03-05 07:56:01
License: 暂无描述

Hugging Face2026-03-05 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/SII-LancelotXie/DRIFT_QAFT

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-3.0 task_categories: - question-answering language: - en tags: - wikipedia - drift - qaft - curriculum-learning pretty_name: DRIFT_QAFT configs: - config_name: "1024-2048" data_files: - split: train path: "1024-2048/train_qa.jsonl" - split: validation path: "1024-2048/val_qa.jsonl" - split: test path: "1024-2048/test_qa.jsonl" - config_name: "2048-4096" data_files: - split: train path: "2048-4096/train_qa.jsonl" - split: validation path: "2048-4096/val_qa.jsonl" - split: test path: "2048-4096/test_qa.jsonl" - config_name: "4096-8192" data_files: - split: train path: "4096-8192/train_qa.jsonl" - split: validation path: "4096-8192/val_qa.jsonl" - split: test path: "4096-8192/test_qa.jsonl" --- # DRIFT_QAFT Dataset DRIFT_QAFT is a Question-Answering dataset designed for the **DRIFT** project, which focuses on decoupling knowledge and reasoning in large language models. ### Dataset Summary This dataset is derived from the Wikipedia dataset released by Wikimedia on Hugging Face: https://huggingface.co/datasets/wikimedia/wikipedia The original data comes from Wikipedia snapshots provided by Wikimedia. Entries are bucketed into long-context intervals based on the **Qwen2Tokenizer** token count. ## Associated Paper This dataset is the official resource for the paper: **[Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference](https://arxiv.org/abs/2602.10021)**. ### Features Each entry contains: - **Document**: The source Wikipedia entry segment. - **Question**: An LLM-generated question based on the document. - **Answer**: An LLM-generated answer to the question. - **Evidence**: LLM-labeled segments from the document that support the answer. ### Usage ```python from datasets import load_dataset # Load a specific interval dataset = load_dataset("SII-LancelotXie/DRIFT_QAFT", "1024-2048") print(dataset["train"][0]) ``` ## Citation If you find this dataset or the DRIFT framework useful in your research, please cite our work: ```bibtex @misc{xie2026decoupledreasoningimplicitfact, title={Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference}, author={Wenxuan Xie and Yujia Wang and Xin Tan and Chaochao Lu and Xia Hu and Xuhong Wang}, year={2026}, eprint={2602.10021}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={[https://arxiv.org/abs/2602.10021](https://arxiv.org/abs/2602.10021)}, } ```

提供机构：

SII-LancelotXie

5,000+

优质数据集

54 个

任务类型

进入经典数据集