five

SII-LancelotXie/DRIFT_QAFT

收藏
Hugging Face2026-03-05 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/SII-LancelotXie/DRIFT_QAFT
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-3.0 task_categories: - question-answering language: - en tags: - wikipedia - drift - qaft - curriculum-learning pretty_name: DRIFT_QAFT configs: - config_name: "1024-2048" data_files: - split: train path: "1024-2048/train_qa.jsonl" - split: validation path: "1024-2048/val_qa.jsonl" - split: test path: "1024-2048/test_qa.jsonl" - config_name: "2048-4096" data_files: - split: train path: "2048-4096/train_qa.jsonl" - split: validation path: "2048-4096/val_qa.jsonl" - split: test path: "2048-4096/test_qa.jsonl" - config_name: "4096-8192" data_files: - split: train path: "4096-8192/train_qa.jsonl" - split: validation path: "4096-8192/val_qa.jsonl" - split: test path: "4096-8192/test_qa.jsonl" --- # DRIFT_QAFT Dataset DRIFT_QAFT is a Question-Answering dataset designed for the **DRIFT** project, which focuses on decoupling knowledge and reasoning in large language models. ### Dataset Summary This dataset is derived from the Wikipedia dataset released by Wikimedia on Hugging Face: https://huggingface.co/datasets/wikimedia/wikipedia The original data comes from Wikipedia snapshots provided by Wikimedia. Entries are bucketed into long-context intervals based on the **Qwen2Tokenizer** token count. ## Associated Paper This dataset is the official resource for the paper: **[Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference](https://arxiv.org/abs/2602.10021)**. ### Features Each entry contains: - **Document**: The source Wikipedia entry segment. - **Question**: An LLM-generated question based on the document. - **Answer**: An LLM-generated answer to the question. - **Evidence**: LLM-labeled segments from the document that support the answer. ### Usage ```python from datasets import load_dataset # Load a specific interval dataset = load_dataset("SII-LancelotXie/DRIFT_QAFT", "1024-2048") print(dataset["train"][0]) ``` ## Citation If you find this dataset or the DRIFT framework useful in your research, please cite our work: ```bibtex @misc{xie2026decoupledreasoningimplicitfact, title={Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference}, author={Wenxuan Xie and Yujia Wang and Xin Tan and Chaochao Lu and Xia Hu and Xuhong Wang}, year={2026}, eprint={2602.10021}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={[https://arxiv.org/abs/2602.10021](https://arxiv.org/abs/2602.10021)}, } ```
提供机构:
SII-LancelotXie
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作