five

SII-LancelotXie/DRIFT_LFRP

收藏
Hugging Face2026-03-05 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/SII-LancelotXie/DRIFT_LFRP
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-3.0 task_categories: - text-generation language: - en tags: - curriculum-learning - wikipedia - drift pretty_name: DRIFT_LFRP size_categories: - 100K<n<1M configs: - config_name: "64_128" data_files: - split: train path: 64_128/train.parquet - split: validation path: 64_128/validation.parquet - split: test path: 64_128/test.parquet - config_name: "128_256" data_files: - split: train path: 128_256/train.parquet - split: validation path: 128_256/validation.parquet - split: test path: 128_256/test.parquet - config_name: "256_512" data_files: - split: train path: 256_512/train.parquet - split: validation path: 256_512/validation.parquet - split: test path: 256_512/test.parquet - config_name: "512_1024" data_files: - split: train path: 512_1024/train.parquet - split: validation path: 512_1024/validation.parquet - split: test path: 512_1024/test.parquet - config_name: "1024_2048" data_files: - split: train path: 1024_2048/train.parquet - split: validation path: 1024_2048/validation.parquet - split: test path: 1024_2048/test.parquet - config_name: "2048_4096" data_files: - split: train path: 2048_4096/train.parquet - split: validation path: 2048_4096/validation.parquet - split: test path: 2048_4096/test.parquet - config_name: "4096_8192" data_files: - split: train path: 4096_8192/train.parquet - split: validation path: 4096_8192/validation.parquet - split: test path: 4096_8192/test.parquet --- # DRIFT_LFRP DRIFT_LFRP is a curriculum-style dataset constructed from the **English Wikipedia snapshot dated November 1, 2023**. The source dataset is released by Wikimedia on Hugging Face: https://huggingface.co/datasets/wikimedia/wikipedia The original data comes from Wikipedia snapshots provided by Wikimedia. Each example corresponds to a cleaned **Wikipedia entry segment**. The dataset is organized by **token-length intervals**, where token counts are computed using **Qwen2Tokenizer**. ## Associated Paper This dataset is the official resource for the paper: **[Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference](https://arxiv.org/abs/2602.10021)**. ## Dataset Structure Each subset corresponds to a token-length interval: - 64_128 - 128_256 - 256_512 - 512_1024 Each subset contains: - train - validation - test Note: The additional subsets (1024-2048, 2048-4096 and 4096-8192 tokens) are currently used for constructing the QAFT (Query-Aware Fine-Tuning) task and were not utilized in the LFRP (Latent Fact Reconstruction Pretraining) data. However, as they share the same data format, they are included here for consistency and future extensions. ## Data Fields | Field | Type | Description | |------|------|-------------| | context | string | Wikipedia text segment | | token_count | int | Number of tokens computed with Qwen2Tokenizer | ## Usage ```python from datasets import load_dataset # 加载特定长度区间的数据集,例如 64-128 dataset = load_dataset("SII-LancelotXie/DRIFT_LFRP", "64_128") # 查看数据 print(dataset["train"][0]) ``` ## Citation If you find this dataset or the DRIFT framework useful in your research, please cite our work: ```bibtex @misc{xie2026decoupledreasoningimplicitfact, title={Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference}, author={Wenxuan Xie and Yujia Wang and Xin Tan and Chaochao Lu and Xia Hu and Xuhong Wang}, year={2026}, eprint={2602.10021}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={[https://arxiv.org/abs/2602.10021](https://arxiv.org/abs/2602.10021)}, } ```
提供机构:
SII-LancelotXie
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作