AmberDatasets

Name: AmberDatasets
Creator: maas
Published: 2025-12-05 12:05:49
License: 暂无描述

魔搭社区2025-12-05 更新2025-04-12 收录

下载链接：

https://modelscope.cn/datasets/LLM360/AmberDatasets

下载链接

链接失效反馈

官方服务：

资源简介：

# Amber-Data <img src="amber_logo.png" alt="amber logo" width="300"/> This dataset contains the fully prepared data sequence used to train Amber, an LLM360 model. ## About LLM360 LLM360 is an initiative for comprehensive and fully open-sourced LLMs, where all training details, model checkpoints, intermediate results, and additional analyses are made available to the community. Our goal is to advance the field by inviting the community to deepen the understanding of LLMs together. As the first step of the project LLM360, we release all intermediate model checkpoints, our fully-prepared pre-training dataset, all source code and configurations, and training details. We are committed to continually pushing the boundaries of LLMs through this open-source effort. Get access now at [LLM360 site](https://www.llm360.ai/) ## Data Description - **Data Format:** 360 tokenized data chunks, each instance has 2049 token indexes. - **License:** Apache 2.0 - **Resources for more information:** - [Code to produce data](https://github.com/LLM360/amber-data-prep) - [Amber Model](https://huggingface.co/LLM360/Amber) ## DataMix The amber dataset uses the following data mix. | Subset | Tokens (Billion) | | ----------- | ----------- | | Arxiv | 30.00 | | Book | 28.86 | | C4 | 197.67 | | Refined-Web | 665.01 | | StarCoder | 291.92 | | StackExchange | 21.75 | | Wikipedia | 23.90 | | Total | 1259.13 | # Loading Amber's Pretraining Data Below is an example of how to download, sample, and detokenize any subset of AmberDatasets corresponding to an Amber checkpoint. Just set the `CHECKPOINT_NUM` to the subset you are interested in (0-359) and point `CHECKPOINT_PATH` to the local checkpoint folder. ```python import random from transformers import AutoTokenizer from datasets import load_dataset CHECKPOINT_NUM = 0 # Pretraining dataset for checkpoint NUM_SAMPLES = 10 # Number of random samples to decode CHECKPOINT_PATH = "/path/to/ckpt_000/" # Local path to a Amber checkpoint dataset = load_dataset( "LLM360/AmberDatasets", data_files=f"train/train_{CHECKPOINT_NUM:03}.jsonl", split=None, ) tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT_PATH) samples = set(random.choices(range(len(dataset["train"])), k=NUM_SAMPLES)) for i, line in enumerate(dataset["train"]): if i in samples: tokens = line["token_ids"] print(f"{i}:{tokenizer.decode(tokens)}") ``` # License We release our work under [ODC-BY](https://opendatacommons.org/licenses/by/1-0/), hence granting the rights over the dataset, but not the contents of the dataset individually. # Citation To cite LLM360, you can cite the following: ``` @misc{liu2023llm360, title={LLM360: Towards Fully Transparent Open-Source LLMs}, author={Zhengzhong Liu and Aurick Qiao and Willie Neiswanger and Hongyi Wang and Bowen Tan and Tianhua Tao and Junbo Li and Yuqi Wang and Suqi Sun and Omkar Pangarkar and Richard Fan and Yi Gu and Victor Miller and Yonghao Zhuang and Guowei He and Haonan Li and Fajri Koto and Liping Tang and Nikhil Ranjan and Zhiqiang Shen and Xuguang Ren and Roberto Iriondo and Cun Mu and Zhiting Hu and Mark Schulze and Preslav Nakov and Tim Baldwin and Eric P. Xing}, year={2023}, eprint={2312.06550}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` If you only uses the original dataset, please cite the original datasets.

# Amber数据集 <img src="amber_logo.png" alt="amber数据集标志" width="300"/> 本数据集包含用于训练LLM360模型Amber的全套预处理数据序列。 ## 关于LLM360 LLM360是一个致力于打造全面且完全开源的大语言模型（Large Language Model，LLM）的项目，其所有训练细节、模型检查点、中间结果及额外分析内容均对社区开放。我们的目标是邀请社区一同深化对大语言模型的理解，推动该领域的发展。作为LLM360项目的第一步，我们公开了所有中间模型检查点、全套预处理预训练数据集、全部源代码与配置文件，以及训练细节。我们承诺将通过这一开源举措，持续拓展大语言模型的边界。即刻前往[LLM360官网](https://www.llm360.ai/)获取访问权限。 ## 数据集说明 - **数据格式**：共360个经词元（Token）化处理的数据块，每个样本包含2049个词元（Token）索引。 - **授权协议**：Apache 2.0 - **更多信息资源**： - [数据集生成代码](https://github.com/LLM360/amber-data-prep) - [Amber模型](https://huggingface.co/LLM360/Amber) ## 数据混合构成本Amber数据集采用如下数据混合配比： | 子集名称 | 词元数（十亿） | | :------------- | :------------- | | Arxiv | 30.00 | | Book | 28.86 | | C4 | 197.67 | | Refined-Web | 665.01 | | StarCoder | 291.92 | | StackExchange | 21.75 | | Wikipedia | 23.90 | | Total | 1259.13 | # 加载Amber预训练数据以下示例展示了如何下载、采样并反词元（Token）化与Amber模型检查点对应的任意Amber数据集子集。只需将`CHECKPOINT_NUM`设置为你感兴趣的子集编号（0-359），并将`CHECKPOINT_PATH`指向本地检查点文件夹即可。 python import random from transformers import AutoTokenizer from datasets import load_dataset CHECKPOINT_NUM = 0 # 对应检查点的预训练数据集 NUM_SAMPLES = 10 # 需解码的随机样本数量 CHECKPOINT_PATH = "/path/to/ckpt_000/" # Amber模型检查点的本地路径 dataset = load_dataset( "LLM360/AmberDatasets", data_files=f"train/train_{CHECKPOINT_NUM:03}.jsonl", split=None, ) tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT_PATH) samples = set(random.choices(range(len(dataset["train"])), k=NUM_SAMPLES)) for i, line in enumerate(dataset["train"]): if i in samples: tokens = line["token_ids"] print(f"{i}:{tokenizer.decode(tokens)}") # 授权协议我们的作品采用[ODC-BY](https://opendatacommons.org/licenses/by/1-0/)协议进行开源，因此授予使用者针对整个数据集的使用权限，但不授予针对数据集内单个内容的使用权限。 # 引用方式若需引用LLM360项目，可使用如下条目： bibtex @misc{liu2023llm360, title={LLM360: Towards Fully Transparent Open-Source LLMs}, author={Zhengzhong Liu and Aurick Qiao and Willie Neiswanger and Hongyi Wang and Bowen Tan and Tianhua Tao and Junbo Li and Yuqi Wang and Suqi Sun and Omkar Pangarkar and Richard Fan and Yi Gu and Victor Miller and Yonghao Zhuang and Guowei He and Haonan Li and Fajri Koto and Liping Tang and Nikhil Ranjan and Zhiqiang Shen and Xuguang Ren and Roberto Iriondo and Cun Mu and Zhiting Hu and Mark Schulze and Preslav Nakov and Tim Baldwin and Eric P. Xing}, year={2023}, eprint={2312.06550}, archivePrefix={arXiv}, primaryClass={cs.CL} } 若仅使用本原始数据集，请引用对应原始数据集的相关文献。

提供机构：

maas

创建时间：

2025-04-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集