Siheng99/Llama-3.1-8B-Instruct-SEALONG-Dataset
收藏SEALONG 数据集概述
数据集发布
- 发布日期: 2024.11.10
- 内容: 包含训练和评估代码、模型和数据集。
数据集使用
模型使用
python import transformers import torch
model_id = "Siheng99/Llama-3.1-8B-Instruct-SEALONG"
pipeline = transformers.pipeline( "text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto", )
messages = [ {"role": "user", "content": "Who are you?"}, ]
outputs = pipeline( messages, max_new_tokens=256, ) print(outputs[0]["generated_text"][-1])
数据使用
python from datasets import load_dataset dataset = load_dataset("Siheng99/Llama-3.1-8B-Instruct-SEALONG-Dataset") print(dataset) print(dataset["train"][0])
数据准备
合成数据
下载 MuSiQue
bash cd data gdown https://drive.google.com/uc?export=download&id=1tGdADlNjWFaHLeZZGShh2IRcpO6Lv24h unzip musique_data_v1.0.zip -d musique && mv musique/data/* musique/ rm -r musique/data && rm musique_data_v1.0.zip
处理 MuSiQue
bash bash scripts/process_data.sh
合成训练数据
bash bash scripts/synthesize.sh
使用预合成数据
python from datasets import load_dataset dataset = load_dataset("Siheng99/Llama-3.1-8B-Instruct-SEALONG-Dataset") dataset.save_to_disk(/path/to/your/save_dir)
数据集引用
bibtex @article{li2024large, title={Large Language Models Can Self-Improve in Long-context Reasoning}, author={Li, Siheng and Yang, Cheng and Cheng, Zesen and Liu, Lemao and Yu, Mo and Yang, Yujiu and Lam, Wai}, journal={arXiv preprint arXiv:2411.08147}, year={2024} }




