five

Yukang/LongAlpaca-12k

收藏
Hugging Face2023-10-11 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Yukang/LongAlpaca-12k
下载链接
链接失效反馈
官方服务:
资源简介:
# LongLoRA and LongAlpaca for Long-context LLMs [![Huggingface Models](https://img.shields.io/badge/Models-Huggingface%20Models-bron)](https://huggingface.co/Yukang) [![Github](https://img.shields.io/badge/Github-Repo-cyan)](https://github.com/dvlab-research/LongLoRA) [![Data](https://img.shields.io/badge/Data-LongAlpaca%2012k-light)](https://huggingface.co/datasets/Yukang/LongAlpaca-12k) [![Paper](https://img.shields.io/badge/Paper-Arvix-blue)](https://arxiv.org/abs/2309.12307) [![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-yellow.svg)](https://github.com/dvlab-research/LongLoRA/blob/main/LICENSE) [![Data License](https://img.shields.io/badge/Data%20License-CC%20By%20NC%204.0-orange.svg)](https://github.com/dvlab-research/LongLoRA/blob/main/DATA_LICENSE) [![Weight License](https://img.shields.io/badge/Weight%20License-CC%20By%20NC%204.0-red)](https://github.com/dvlab-research/LongLoRA/blob/main/WEIGHT_LICENSE) For detailed usage and codes, please visit the [Github project](https://github.com/dvlab-research/LongLoRA). ## TABLE OF CONTENTS 1. [News](#news) 2. [Examples](#examples) 3. [Highlights](#highlights) 4. [How to contribute](#how-to-contribute) 5. [Requirements](#usage-requirements) 6. [Installation and quick guide](#installation-and-quick-guide) 7. [LongAlpaca Data](#longalpaca-data) 8. [Models](#models) 9. [Training](#training) 10. [Evaluation](#evaluation) 11. [Demo](#demo) 12. [Data Generation via Pdf2Text](#data-generation-via-pdf2text) 13. [Citation](#citation) 14. [Acknowledgement](#acknowledgement) 15. [License](#license) ## News - [x] [2023.10.8] **We release the long instruction-following dataset**, [LongAlpaca-12k](https://huggingface.co/datasets/Yukang/LongAlpaca-12k) and **the corresponding models**, [LongAlpaca-7B](https://huggingface.co/Yukang/LongAlpaca-7B), [LongAlpaca-13B](https://huggingface.co/Yukang/LongAlpaca-13B), and [LongAlpaca-70B](https://huggingface.co/Yukang/LongAlpaca-70B). - (*The previous sft models*, [Llama-2-13b-chat-longlora-32k-sft](https://huggingface.co/Yukang/Llama-2-13b-chat-longlora-32k-sft) and [Llama-2-70b-chat-longlora-32k-sft](https://huggingface.co/Yukang/Llama-2-70b-chat-longlora-32k-sft), *have been depreciated*.) - [x] [2023.10.3] We add support GPTNeoX models. Please refer to this [PR](https://github.com/dvlab-research/LongLoRA/pull/32) for usage. Thanks for @naubull2 for this contribution. - [x] [2023.9.22] We release all our fine-tuned [models](https://huggingface.co/Yukang), including **70B-32k models**, [LLaMA2-LongLoRA-70B-32k](https://huggingface.co/Yukang/Llama-2-70b-longlora-32k), [LLaMA2-LongLoRA-7B-100k](https://huggingface.co/Yukang/Llama-2-7b-longlora-100k-ft). Welcome to check them out! - [x] [2023.9.22] We release [Paper](http://arxiv.org/abs/2309.12307) and this GitHub repo, including training and evaluation code. **LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models [[Paper](http://arxiv.org/abs/2309.12307)]** <br /> [Yukang Chen](https://scholar.google.com/citations?user=6p0ygKUAAAAJ&hl=en), [Shengju Qian](https://scholar.google.com/citations?user=QNnWmasAAAAJ), [Haotian Tang](https://scholar.google.com/citations?user=WxL13BAAAAAJ&hl), [Xin Lai](https://scholar.google.com/citations?user=tqNDPA4AAAAJ&hl=zh-CN), [Zhijian Liu](https://scholar.google.com/citations?user=3coYSTUAAAAJ&hl=en), [Song Han](https://scholar.google.com/citations?user=E0iCaa4AAAAJ&hl=zh-CN), [Jiaya Jia](https://scholar.google.com/citations?user=XPAkzTEAAAAJ&hl=en)<br /> ## Highlights 1. In LongLoRA approach, The proposed shifted short attention is easy to implement, compatible with Flash-Attention, and is not required during inference. 2. We released all our models, including models from 7B to 70B, context length from 8k to 100k, including [LLaMA2-LongLoRA-7B-100k](https://huggingface.co/Yukang/Llama-2-7b-longlora-100k-ft), [LLaMA2-LongLoRA-13B-64k](https://huggingface.co/Yukang/Llama-2-13b-longlora-64k), and [LLaMA2-LongLoRA-70B-32k](https://huggingface.co/Yukang/Llama-2-70b-longlora-32k). 3. We built up a long-context instruction-following dataset, [LongAlpaca-12k](#longalpaca-data). We released the corresponding [LongAlpaca-7B](https://huggingface.co/Yukang/LongAlpaca-7B), [LongAlpaca-13B](https://huggingface.co/Yukang/LongAlpaca-13B) and [LongAlpaca-70B](https://huggingface.co/Yukang/LongAlpaca-70B) models. To our best knowledge, this is the first open-sourced long-context 70B model. ## How to Contribute - Make sure to have git installed. - Create your own [fork](https://github.com/dvlab-research/LongLoRA/fork) of the project. - Clone the repository on your local machine, using git clone and pasting the url of this project. - Read both the `Requirements` and `Installation and Quick Guide` sections below. - Commit and push your changes. - Make a pull request when finished modifying the project. ## Usage Requirements To download and use the [pre-trained weights](#pre-trained-weights) you will need: 1. Hugging Face (HF) account with valid email. Note, the email used for HF must alse be used for the license agreement. 2. Accept the Meta [license and acceptable use policy](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) ## Installation and Quick Guide To install and run the application: 1. [Fork this repo](https://github.com/dvlab-research/LongLoRA/fork) on github 2. Clone the repository on your local machine, using git clone and pasting the url of this project. 3. Run the following code: ``` pip install -r requirements.txt pip install flash-attn --no-build-isolation ``` 4. Use either a [Released model](#released-models) or [Fine tune](#fine-tuning) a model to fit your preferences. 5. Test your model by chat. 6. Deploy your own demo. ## LongAlpaca Data LongAlpaca-12k contains 9k long QA data that we collected and 3k short QA sampled from the original [Alpaca data](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json). This is to avoid the case that the model might degrade at short instruction following. The data we collect contains various types and amounts as the following figure. | Data | Short QA | Long QA | Total | Download | |:---------------|----------|----------|----------|----------| | LongAlpaca-12k | 3k | 9k | 12k | [Link](https://huggingface.co/datasets/Yukang/LongAlpaca-12k) | Following the original Alpaca format, our Long QA data uses the following prompts for fine-tuning: - `instruction`: `str`, describes the task the model should perform. For example, to answer a question after reading a book section or paper. We vary the contents and questions to make instructions diverse. - `output`: `str`, the answer to the instruction. We did not use the `input` format in the Alpaca format for simplicity. ## Models ### Models with supervised fine-tuning | Model | Size | Context | Train | Link | |:---------------|------|---------|---------|-----------------------------------------------------------------------------------------------------------------------| | LongAlpaca-7B | 7B | 32768 | Full FT | [Model](https://huggingface.co/Yukang/LongAlpaca-7B) | | LongAlpaca-13B | 13B | 32768 | Full FT | [Model](https://huggingface.co/Yukang/LongAlpaca-13B) | | LongAlpaca-70B | 70B | 32768 | LoRA+ | [Model](https://huggingface.co/Yukang/LongAlpaca-70B) [(LoRA-weight)](https://huggingface.co/Yukang/LongAlpaca-70B-lora) | ### Models with context extension via fully fine-tuning | Model | Size | Context | Train | Link | |:----------------------------|------|---------|-------|-------------------------------------------------------------------| | Llama-2-7b-longlora-8k-ft | 7B | 8192 | Full FT | [Model](https://huggingface.co/Yukang/Llama-2-7b-longlora-8k-ft) | | Llama-2-7b-longlora-16k-ft | 7B | 16384 | Full FT | [Model](https://huggingface.co/Yukang/Llama-2-7b-longlora-16k-ft) | | Llama-2-7b-longlora-32k-ft | 7B | 32768 | Full FT | [Model](https://huggingface.co/Yukang/Llama-2-7b-longlora-32k-ft) | | Llama-2-7b-longlora-100k-ft | 7B | 100000 | Full FT | [Model](https://huggingface.co/Yukang/Llama-2-7b-longlora-100k-ft) | | Llama-2-13b-longlora-8k-ft | 13B | 8192 | Full FT | [Model](https://huggingface.co/Yukang/Llama-2-13b-longlora-8k-ft) | | Llama-2-13b-longlora-16k-ft | 13B | 16384 | Full FT | [Model](https://huggingface.co/Yukang/Llama-2-13b-longlora-16k-ft) | | Llama-2-13b-longlora-32k-ft | 13B | 32768 | Full FT | [Model](https://huggingface.co/Yukang/Llama-2-13b-longlora-32k-ft) | ### Models with context extension via improved LoRA fine-tuning | Model | Size | Context | Train | Link | |:----------------------------|------|---------|-------|---------------------------------------------------------------------| | Llama-2-7b-longlora-8k | 7B | 8192 | LoRA+ | [LoRA-weight](https://huggingface.co/Yukang/Llama-2-7b-longlora-8k) | | Llama-2-7b-longlora-16k | 7B | 16384 | LoRA+ | [LoRA-weight](https://huggingface.co/Yukang/Llama-2-7b-longlora-16k) | | Llama-2-7b-longlora-32k | 7B | 32768 | LoRA+ | [LoRA-weight](https://huggingface.co/Yukang/Llama-2-7b-longlora-32k) | | Llama-2-13b-longlora-8k | 13B | 8192 | LoRA+ | [LoRA-weight](https://huggingface.co/Yukang/Llama-2-13b-longlora-8k) | | Llama-2-13b-longlora-16k | 13B | 16384 | LoRA+ | [LoRA-weight](https://huggingface.co/Yukang/Llama-2-13b-longlora-16k) | | Llama-2-13b-longlora-32k | 13B | 32768 | LoRA+ | [LoRA-weight](https://huggingface.co/Yukang/Llama-2-13b-longlora-32k) | | Llama-2-13b-longlora-64k | 13B | 65536 | LoRA+ | [LoRA-weight](https://huggingface.co/Yukang/Llama-2-13b-longlora-64k) | | Llama-2-70b-longlora-32k | 70B | 32768 | LoRA+ | [LoRA-weight](https://huggingface.co/Yukang/Llama-2-70b-longlora-32k) | | Llama-2-70b-chat-longlora-32k | 70B | 32768 | LoRA+ | [LoRA-weight](https://huggingface.co/Yukang/Llama-2-70b-chat-longlora-32k) | ## Training ### Pre-trained weights We use LLaMA2 models as the pre-trained weights and fine-tune them to long context window sizes. Download based on your choices. | Pre-trained weights | |:-------------------------------------------------------------------------------------| | [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) | |[Llama-2-13b-hf](https://huggingface.co/meta-llama/Llama-2-13b-hf) | | [Llama-2-70b-hf](https://huggingface.co/meta-llama/Llama-2-70b-hf) | | [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) | | [Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) | | [Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) | This project also supports GPTNeoX models as the base model architecture. Some candidate pre-trained weights may include [GPT-NeoX-20B](https://huggingface.co/EleutherAI/gpt-neox-20b), [Polyglot-ko-12.8B](https://huggingface.co/EleutherAI/polyglot-ko-12.8b) and other variants. ### Fine-tuning ``` torchrun --nproc_per_node=8 fine-tune.py \ --model_name_or_path path_to/Llama-2-7b-hf \ --bf16 True \ --output_dir path_to_saving_checkpoints \ --cache_dir path_to_cache \ --model_max_length 8192 \ --use_flash_attn True \ --low_rank_training False \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 2 \ --learning_rate 2e-5 \ --weight_decay 0.0 \ --warmup_steps 20 \ --lr_scheduler_type "constant_with_warmup" \ --logging_steps 1 \ --deepspeed "ds_configs/stage2.json" \ --tf32 True \ --max_steps 1000 ``` - Please remember to change `path_to/Llama-2-7b-hf`, `path_to_saving_checkpoints`, `path_to_cache` to your own directory. - Note that you can change `model_max_length` to other values. - You could change `ds_configs/stage2.json` to `ds_configs/stage3.json` if you want. - Please set `use_flash_attn` as `False` if you use V100 machines or do not install flash attention. - You can set `low_rank_training` as `False` if you want to use fully fine-tuning. It will cost more GPU memory and slower, but the performance will be a bit better. - When training is finished, to get the full model weight: ``` cd path_to_saving_checkpoints && python zero_to_fp32.py . pytorch_model.bin ``` ### Supervised Fine-tuning ``` torchrun --nproc_per_node=8 supervised-fine-tune.py \ --model_name_or_path path_to_Llama2_chat_models \ --bf16 True \ --output_dir path_to_saving_checkpoints \ --model_max_length 32768 \ --use_flash_attn True \ --data_path LongAlpaca-12k.json \ --low_rank_training True \ --num_train_epochs 3 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 2 \ --learning_rate 2e-5 \ --weight_decay 0.0 \ --warmup_steps 20 \ --lr_scheduler_type "constant_with_warmup" \ --logging_steps 1 \ --deepspeed "ds_configs/stage2.json" \ --tf32 True ``` - There is no need to make supervised fine-tuning upon the fine-tuned context extended models. It is all right to directly use base model as Llama2-chat models, as the amount of long instruction following data is enough for SFT. - Our long instruction following data can be found in [LongAlpaca-12k.json](https://huggingface.co/datasets/Yukang/LongAlpaca-12k). ### Get trainable weights in low-rank training In low-rank training, we set embedding and normalization layers as trainable. Please use the following line to extract the trainable weights `trainable_params.bin` from `pytorch_model.bin` ``` python3 get_trainable_weights.py --checkpoint_path path_to_saving_checkpoints --trainable_params "embed,norm" ``` ### Merge LoRA Weight Merge the LoRA weights of `pytorch_model.bin` and trainable parameters `trainable_params.bin`, save the resulting model into your desired path in the Hugging Face format: ``` python3 merge_lora_weights_and_save_hf_model.py \ --base_model path_to/Llama-2-7b-hf \ --peft_model path_to_saving_checkpoints \ --context_size 8192 \ --save_path path_to_saving_merged_model ``` For example, ``` python3 merge_lora_weights_and_save_hf_model.py \ --base_model /dataset/pretrained-models/Llama-2-7b-hf \ --peft_model /dataset/yukangchen/hf_models/lora-models/Llama-2-7b-longlora-8k \ --context_size 8192 \ --save_path /dataset/yukangchen/models/Llama-2-7b-longlora-8k-merged ``` ## Evaluation ### Perplexity Validation To evaluate a model that is trained in the low-rank setting, please set both `base_model` and `peft_model`. `base_model` is the pre-trained weight. `peft_model` is the path to the saved checkpoint, which should contain `trainable_params.bin`, `adapter_model.bin` and `adapter_config.json`. For example, ``` python3 eval.py --seq_len 8192 --context_size 8192 --batch_size 1 --base_model path_to/Llama-2-7b-hf --peft_model path_to_saving_checkpoints --data_path pg19/test.bin ``` To evaluate a model that is fully fine-tuned, you only need to set `base_model` as the path to the saved checkpoint, which should contain `pytorch_model.bin` and `config.json`. `peft_model` should be ignored. ``` python3 eval.py --seq_len 8192 --context_size 8192 --batch_size 1 --base_model path_to_saving_checkpoints --data_path pg19/test.bin ``` - Note that `--seq_len` is to set the sequence length for evaluation. `--context_size` is to set the context length of the model during fine-tuning. `--seq_len` should not be larger than `--context_size`. - We have already tokenized the validation and test splits of PG19 and proof-pile dataset into `pg19/validation.bin`, `pg19/test.bin`, and `proof-pile/test_sampled_data.bin`, with the tokenizer of LLaMA. `proof-pile/test_sampled_data.bin` contains 128 documents that are randomly sampled from the total proof-pile test split. For each document, it has at least 32768 tokens. We also release the sampled ids in [proof-pile/test_sampled_ids.bin](https://drive.google.com/file/d/1cnzWODLRQYAd7HeugzLCIhaqzaLZv7J5/view?usp=share_link). You can download them from the links below. | Dataset | Split | Link | |:-----------|------------|--------------------------------------------------------------------------------------------------------------| | PG19 | validation | [pg19/validation.bin](https://drive.google.com/file/d/1rbJvb0qRIf2mQoN2ON7S93TbTzMnlrN6/view?usp=share_link) | | PG19 | test | [pg19/test.bin](https://drive.google.com/file/d/1QANDMdctpacPAYgS04adDXqByGEq-Ret/view?usp=share_link) | | Proof-pile | test | [proof-pile/test_sampled_data.bin](https://drive.google.com/file/d/1bUI5lPDvrqzY_XXJJ2sSuvZx0Y9AZClE/view?usp=share_link) | ### Passkey Retrieval We provide a manner to test the passkey retrieval accuracy. For example, ``` python3 passkey_retrivial.py \ --context_size 32768 \ --base_model path_to/Llama-2-7b-longlora-32k \ --max_tokens 32768 \ --interval 1000 ``` - Note that the `context_size` is the context length during fine-tuning. - `max_tokens` is maximum length for the document in passkey retrieval evaluation. - `interval` is the interval during the document length increasing. It is a rough number because the document increases by sentences. ## Demo ### Local Inference To chat with [Llama-2-13b-chat-longlora-32k-sft](https://huggingface.co/Yukang/Llama-2-13b-chat-longlora-32k-sft) or [Llama-2-70b-chat-longlora-32k-sft](https://huggingface.co/Yukang/Llama-2-70b-chat-longlora-32k-sft), you need to run `merge_lora_weights_and_save_hf_model.py` first, and then: ``` python3 inference.py \ --base_model path_to_model \ --question $question \ --context_size $context_length \ --max_gen_len $max_gen_len \ --flash_attn True \ --material $material_content \ --material_type $material_type \ --material_title $material_title ``` To ask a question related to a book: ``` python3 inference.py \ --base_model /data/models/Llama-2-13b-chat-longlora-32k-sft \ --question "Why doesn't Professor Snape seem to like Harry?" \ --context_size 32768 \ --max_gen_len 512 \ --flash_attn True \ --material "materials/Harry Potter and the Philosophers Stone_section2.txt" \ --material_type "book" \ --material_title "Harry Potter and the Philosophers Stone" ``` Note that you can ignore `material_type` or `material_title`. To ask a question related to a paper: ``` python3 inference.py \ --base_model /data/models/Llama-2-13b-chat-longlora-32k-sft \ --question "What are the main contributions and novelties of this work?" \ --context_size 32768 \ --max_gen_len 512 \ --flash_attn True \ --material "materials/paper1.txt" \ --material_type "paper" ``` ### Online Demo To deploy your own demo run ``` python3 demo.py \ --base_model path_to_model \ --context_size $context_size \ --max_gen_len $max_gen_len \ --flash_attn True ``` Example ``` python3 demo.py \ --base_model /data/models/Llama-2-13b-chat-longlora-32k-sft \ --context_size 32768 \ --max_gen_len 512 \ --flash_attn True ``` - Note that `flash_attn=True` will make the generation slow but save much GPU memory. ## Data Generation via Pdf2text During our dataset collection, we convert paper and books from pdf to text. The conversion quality has a large influence on the final model quality. We think that this step is non-trivial. We release the tool for the pdf2txt conversion, in the folder `pdf2txt`. It is built upon `pdf2image`, `easyocr`, `ditod` and `detectron2`. Please refer to the [README.md](pdf2txt/README.md) in `pdf2txt` for more details. ## Citation If you find this project useful in your research, please consider citing: ``` @article{longlora, title={LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models}, author={Yukang Chen and Shengju Qian and Haotian Tang and Xin Lai and Zhijian Liu and Song Han and Jiaya Jia}, journal={arXiv:2309.12307}, year={2023} } ``` ``` @misc{long-alpaca, author = {Yukang Chen and Shaozuo Yu and Shengju Qian and Haotian Tang and Xin Lai and Zhijian Liu and Song Han and Jiaya Jia}, title = {Long Alpaca: Long-context Instruction-following models}, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/dvlab-research/LongLoRA}}, } ``` ## Acknowledgement - This work is built upon the [LLaMA2](https://ai.meta.com/llama) as the pre-trained models. - This work can also be built upon the [GPTNeoX-HF](https://huggingface.co/docs/transformers/model_doc/gpt_neox) which is based upon [EleutherAI/GPTNeoX](https://github.com/EleutherAI/gpt-neox) as the pre-trained model architecture. - This work is based on [DeepSpeed](https://github.com/microsoft/DeepSpeed), [peft](https://github.com/huggingface/peft), and [Flash-Attention2](https://github.com/Dao-AILab/flash-attention) for acceleration. - Some evaluation code is modified upon [Landmark Attention](https://github.com/epfml/landmark-attention). - We use [LongChat](https://github.com/DachengLi1/LongChat) for the retrieval evaluation. ## License - LongLoRA is licensed under the Apache License 2.0. This means that it requires the preservation of copyright and license notices. - Data and weights are under CC-BY-NC 4.0 License. They are licensed for research use only, and allowed only non-commercial. Models trained using the dataset should not be used outside of research purposes.
提供机构:
Yukang
原始信息汇总

LongLoRA and LongAlpaca 数据集概述

数据集信息

  • 名称: LongAlpaca-12k
  • 描述: LongAlpaca-12k 包含 9k 长问答数据和 3k 短问答数据,其中短问答数据采样自原始的 Alpaca 数据。
  • 格式: 遵循 Alpaca 数据格式,包含 instructionoutput 字段。
  • 下载链接: LongAlpaca-12k

数据集结构

数据类型 短问答 长问答 总计 下载链接
LongAlpaca-12k 3k 9k 12k Link

模型信息

监督微调模型

模型名称 大小 上下文长度 训练方式 链接
LongAlpaca-7B 7B 32768 完全微调 Model
LongAlpaca-13B 13B 32768 完全微调 Model
LongAlpaca-70B 70B 32768 LoRA+ Model (LoRA-weight)

上下文扩展模型

模型名称 大小 上下文长度 训练方式 链接
Llama-2-7b-longlora-8k-ft 7B 8192 完全微调 Model
Llama-2-7b-longlora-16k-ft 7B 16384 完全微调 Model
Llama-2-7b-longlora-32k-ft 7B 32768 完全微调 Model
Llama-2-7b-longlora-100k-ft 7B 100000 完全微调 Model
Llama-2-13b-longlora-8k-ft 13B 8192 完全微调 Model
Llama-2-13b-longlora-16k-ft 13B 16384 完全微调 Model
Llama-2-13b-longlora-32k-ft 13B 32768 完全微调 Model

LoRA 微调模型

模型名称 大小 上下文长度 训练方式 链接
Llama-2-7b-longlora-8k 7B 8192 LoRA+ LoRA-weight
Llama-2-7b-longlora-16k 7B 16384 LoRA+ LoRA-weight
Llama-2-7b-longlora-32k 7B 32768 LoRA+ LoRA-weight
Llama-2-13b-longlora-8k 13B 8192 LoRA+ LoRA-weight
Llama-2-13b-longlora-16k 13B 16384 LoRA+ LoRA-weight
Llama-2-13b-longlora-32k 13B 32768 LoRA+ LoRA-weight
Llama-2-13b-longlora-64k 13B 65536 LoRA+ LoRA-weight
Llama-2-70b-longlora-32k 70B 32768 LoRA+ LoRA-weight
Llama-2-70b-chat-longlora-32k 70B 32768 LoRA+ LoRA-weight

训练信息

预训练权重

  • Llama-2-7b-hf: Link
  • Llama-2-13b-hf: Link
  • Llama-2-70b-hf: Link
  • Llama-2-7b-chat-hf: Link
  • Llama-2-13b-chat-hf: Link
  • Llama-2-70b-chat-hf: Link

微调命令

bash torchrun --nproc_per_node=8 fine-tune.py
--model_name_or_path path_to/Llama-2-7b-hf --bf16 True --output_dir path_to_saving_checkpoints
--cache_dir path_to_cache --model_max_length 8192 --use_flash_attn True --low_rank_training False --num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 2
--gradient_accumulation_steps 8
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 1000
--save_total_limit 2
--learning_rate 2e-5
--weight_decay 0.0
--warmup_steps 20
--lr_scheduler_type "constant_with_warmup"
--logging_steps 1
--deepspeed "ds_configs/stage2.json" --tf32 True --max_steps 1000

监督微调命令

bash torchrun --nproc_per_node=8 supervised-fine-tune.py
--model_name_or_path path_to_Llama2_chat_models --bf16 True --output_dir path_to_saving_checkpoints
--model_max_length 32768 --use_flash_attn True --data_path LongAlpaca-12k.json --low_rank_training True --num_train_epochs 3
--per_device_train_batch_size 1
--per_device_eval_batch_size 2
--gradient_accumulation_steps 1
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 1000
--save_total_limit 2
--learning_rate 2e-5
--weight_decay 0.0
--warmup_steps 20
--lr_scheduler_type "constant_with_warmup"
--logging_steps 1
--deepspeed "ds_configs/stage2.json" --tf32 True

合并 LoRA 权重命令

bash python3 merge_lora_weights_and_save_hf_model.py --base_model path_to/Llama-2-7b-hf --peft_model path_to_saving_checkpoints --context_size 8192 --save_path path_to_saving_merged_model

评估信息

困惑度验证

bash python3 eval.py --seq_len 8192 --context_size 8192 --batch_size 1 --base_model path_to/Llama-2-7b-hf --peft_model path_to_saving_checkpoints --data_path pg19/test.bin

通行证检索

bash python3 passkey_retrivial.py --context_size 32768 --base_model path_to/Llama-2-7b-longlora-32k --max_tokens 32768 --interval 1000

演示信息

本地推理

bash python3 inference.py
--base_model path_to_model --question $question --context_size $context_length --max_gen_len $max_gen_len --flash_attn True --material $material_content --material_type $material_type --material_title $material_title

搜集汇总
数据集介绍
main_image_url
构建方式
在长上下文大语言模型的研究领域中,LongAlpaca-12k数据集的构建体现了对高质量指令遵循数据的精心设计。该数据集包含12,000条样本,其中9,000条为自主收集的长问答数据,旨在覆盖多样化的长文本理解任务,如阅读书籍章节或学术论文后回答问题;其余3,000条则从原始Alpaca数据中采样,以维持模型在短指令遵循任务上的性能。构建过程中遵循了Alpaca格式,仅使用指令和输出字段,简化了数据结构,确保了数据的一致性与可用性。
使用方法
该数据集主要用于监督微调阶段,以增强大语言模型的长上下文指令遵循能力。研究人员可通过提供的脚本加载数据集,结合LongLoRA技术进行高效训练,支持全参数微调或低秩适配。使用前需配置相应环境,如安装Flash-Attention等依赖,并按照指导调整训练参数。数据集可直接从HuggingFace平台下载,与预训练模型结合,实现从上下文扩展到指令微调的全流程,为长文本理解和生成任务提供可靠的数据基础。
背景与挑战
背景概述
在大型语言模型(LLM)快速发展的背景下,处理长上下文序列的能力成为提升模型实用性的关键。2023年,由Yukang Chen等研究人员组成的团队发布了LongAlpaca-12k数据集,作为LongLoRA项目的重要组成部分。该数据集旨在解决长上下文指令跟随任务,通过结合9k条长问答数据和3k条源自原始Alpaca的短问答数据,构建了一个包含12k条样本的指令微调数据集。其核心研究问题聚焦于如何高效扩展LLM的上下文窗口,并使其能够准确理解与响应长文档中的复杂指令。这一工作不仅推动了长上下文模型的高效微调技术发展,也为后续研究提供了宝贵的数据资源与模型基准。
当前挑战
长上下文语言模型面临的核心挑战在于模型需在极长的输入序列中保持连贯的语义理解与准确的指令跟随能力,传统注意力机制的计算复杂度随序列长度呈平方级增长,导致训练与推理成本高昂。在数据集构建过程中,研究人员需克服高质量长文本指令数据的稀缺性,确保数据多样性与任务代表性,同时平衡长、短问答样本的比例,以避免模型在短指令任务上出现性能退化。此外,数据生成与标注过程需处理大量文本的语义完整性,并设计有效的提示格式以适配模型微调需求。
常用场景
经典使用场景
在长上下文大语言模型的研究领域,LongAlpaca-12k数据集作为首个开源的万词级指令跟随数据集,其经典使用场景聚焦于训练和评估模型处理超长文本序列的能力。该数据集通过结合9k条长问答数据和3k条短问答样本,有效支撑了模型在阅读书籍章节、学术论文等复杂材料后执行问答任务。这种设计不仅验证了模型在扩展上下文窗口下的信息保持与推理性能,更为长文本理解任务的标准化评测提供了关键数据基础。
解决学术问题
该数据集直接应对大语言模型在长上下文处理中存在的核心学术挑战,即传统模型受限于有限上下文窗口而难以维持长距离语义连贯性。通过提供结构化的长指令跟随数据,它使研究者能够系统探究位置编码优化、注意力机制扩展等关键问题。其意义在于突破了开源社区缺乏高质量长文本监督数据的瓶颈,为长上下文高效微调方法的验证与比较建立了可靠基准,推动了长文本理解技术的可复现性研究。
实际应用
在实际应用层面,基于LongAlpaca-12k训练的模型已展现出处理复杂文档分析的卓越能力。这些模型可部署于智能文献综述系统,自动解析数万字的学术论文并提炼核心贡献;在司法文档分析场景中,能精准定位跨多页的法律条款关联;还可集成于企业知识库问答平台,实现对长篇技术手册、产品说明书的深度语义查询。这种长文本理解能力显著提升了专业领域信息处理的自动化水平。
数据集最近研究
最新研究方向
在大型语言模型领域,扩展上下文处理能力已成为前沿研究的核心议题。LongAlpaca-12k数据集作为长上下文指令跟随数据的重要资源,其构建旨在解决模型在长文档理解与生成任务中的性能瓶颈。该数据集结合了9k条长问答数据与3k条短问答样本,有效平衡了模型在长短指令上的表现,为长上下文模型的监督微调提供了关键支撑。当前研究热点聚焦于通过LongLoRA等高效微调技术,以较低计算成本将预训练模型的上下文窗口扩展至数万乃至十万令牌级别,从而推动模型在复杂文档分析、多轮对话及知识密集型任务中的应用。这一进展不仅提升了模型处理长文本的实用性与效率,也为开源社区提供了首个70B参数规模的长上下文模型,对推动自然语言处理技术的民主化与普及具有深远意义。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作