LLM360/AmberDatasets
收藏Amber-Data
数据描述
数据混合
Amber数据集使用以下数据混合:
| 子集 | 令牌数(十亿) |
|---|---|
| Arxiv | 30.00 |
| Book | 28.86 |
| C4 | 197.67 |
| Refined-Web | 665.01 |
| StarCoder | 291.92 |
| StackExchange | 21.75 |
| Wikipedia | 23.90 |
| 总计 | 1259.13 |
加载Amber的预训练数据
以下是如何下载、采样和解码Amber数据集的任意子集的示例代码:
python import random from transformers import AutoTokenizer from datasets import load_dataset
CHECKPOINT_NUM = 0 # 预训练数据集的检查点编号 NUM_SAMPLES = 10 # 要解码的随机样本数量 CHECKPOINT_PATH = "/path/to/ckpt_000/" # Amber检查点的本地路径
dataset = load_dataset( "LLM360/AmberDatasets", data_files=f"train/train_{CHECKPOINT_NUM:03}.jsonl", split=None, )
tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT_PATH) samples = set(random.choices(range(len(dataset["train"])), k=NUM_SAMPLES))
for i, line in enumerate(dataset["train"]): if i in samples: tokens = line["token_ids"] print(f"{i}:{tokenizer.decode(tokens)}")
许可证
本工作在ODC-BY许可下发布,授予对数据集的权利,但不包括数据集内容的个别权利。
引用
如需引用LLM360,请使用以下引用:
@misc{liu2023llm360, title={LLM360: Towards Fully Transparent Open-Source LLMs}, author={Zhengzhong Liu and Aurick Qiao and Willie Neiswanger and Hongyi Wang and Bowen Tan and Tianhua Tao and Junbo Li and Yuqi Wang and Suqi Sun and Omkar Pangarkar and Richard Fan and Yi Gu and Victor Miller and Yonghao Zhuang and Guowei He and Haonan Li and Fajri Koto and Liping Tang and Nikhil Ranjan and Zhiqiang Shen and Xuguang Ren and Roberto Iriondo and Cun Mu and Zhiting Hu and Mark Schulze and Preslav Nakov and Tim Baldwin and Eric P. Xing}, year={2023}, eprint={2312.06550}, archivePrefix={arXiv}, primaryClass={cs.CL} }
如果仅使用原始数据集,请引用原始数据集。




