dolma

Name: dolma
Creator: maas
Published: 2026-05-05 06:59:06
License: 暂无描述

魔搭社区2026-05-05 更新2024-06-08 收录

下载链接：

https://modelscope.cn/datasets/swift/dolma

下载链接

链接失效反馈

官方服务：

资源简介：

# Dolma <img alt="Dolma's official logo. It's dolma written in yellow, round lowercase letters over a blue background." src="https://raw.githubusercontent.com/allenai/dolma/main/docs/assets/AI2_Blog_1400x685_2x.webp" width="100%"> Dolma is a dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials. More information: - Read Dolma **manuscript** and its **Data Sheet** [on ArXiv](https://arxiv.org/abs/2402.00159); - Explore the [**open source tools**](https://github.com/allenai/dolma) we created to curate Dolma. - Want to request removal of personal data? Use [this form](https://forms.gle/q4BNUUxUxKwKkfdT6) to notify us of documents containing PII about a specific user. To learn more about the toolkit used to create Dolma, including how to replicate this dataset, head over our [GitHub project page](https://github.com/allenai/dolma/tree/main/docs)! **2024-04-17: Dolma v1.7 Release.** We have released an updated version of Dolma that we used to train our latest [OLMo 7B-v1.7](https://huggingface.co/allenai/OLMo-7b-v1.7) model. **2024-04-15: License Change.** We have updated the license of Dolma to [ODC-BY](https://opendatacommons.org/licenses/by/1-0/). Please see this [blog post](https://blog.allenai.org/making-a-switch-dolma-moves-to-odc-by-8f0e73852f44) for more information. ## Versions At the moment, there are six versions of Dolma available: | **Version** | **Default?** | **Release Date** | **Size** (gzip) | **Description** | |--|:--:|--|--|--| | `v1_7` | ✅ | 2024-04-15 | 4.5 TB | Used to train [OLMo-7B-v1.7](https://huggingface.co/allenai/OLMo-7b-v1.7). New sources, more quality filtering, fuzzy deduplication. | | `v1_6` | | 2024-01-31 | 5.4 TB | An update to v1.5 with some deduplication of documents with too few tokens or too many repeated n-grams. | | `v1_6-sample` | | 2024-01-31 | 16.4 GB | A smaller sample of Dolma, with roughly 10 billion tokens. Useful for data exploration. | | `v1_5` | | 2023-10-31 | 6.4 TB | Used to train [OLMo-1B](https://huggingface.co/allenai/OLMo-1B). Roughly 3 trillion tokens. | | `v1_5-sample` | | 2023-10-31 | 2.9 TB | A sample of roughly 1.9 trillion tokens used to train [OLMo-7B](https://huggingface.co/allenai/OLMo-7B) | | `v1` | | 2023-08-18 | 6.0 TB | The first version of Dolma. | ## Summary Statistics (v1.7) | **Source** | **Provenance** | **New?** | **Documents** (millions) | **OLMo tokens** (billions) | **Sample Proportion** | **Cutoff Date** | **Processing** |--|--|--|--|--|--|--|--| | Dolma's CC | [Common Crawl](https://commoncrawl.org/) via Dolma v1.6 | Updated | 875.2 | 1,195.5 | 50% | Mar 2023 | Extracted using the Dolma pipeline; new quality filtering and deduplication steps. | | Refined Web | [Refined Web](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) | Yes | 664.0 | 456.4 | 100% | Feb 2023 | Filtered using the Dolma pipeline; new quality filtering and deduplication steps. | | StarCoder | [StarCoder](https://huggingface.co/blog/starcoder) | Yes | 206.6 | 263.8 | 100% | May 2023 | No further processing. | | C4 | [C4](https://huggingface.co/datasets/c4) via Dolma v1.6 | Updated | 249.9 | 138.4 | 50% | Apr 2019 | Filtered using the Dolma pipeline; new quality filtering and deduplication steps. | | Reddit | [PushShift API](https://github.com/pushshift/api) | Updated | 377.4 | 79.9 | 100% | Mar 2023 | Extracted using the Dolma pipeline; new quality filtering and deduplication steps. | | Semantic Scholar ([S2ORC](https://aclanthology.org/2020.acl-main.447/) & [S2AG](https://www.semanticscholar.org/product/api)) | [peS2o](https://huggingface.co/datasets/allenai/peS2o) via Dolma v1.6 | No | 38.8 | 57.2 | 100% | Mar 2023 | Same as Dolma v1.6 | | arXiv | [RedPajama v1](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) | Yes | 1.5 | 28.0 | 100% | Mar 2023 | No further processing. | | StackExchange | [RedPajama v1](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) | Yes | 29.3 | 19.6 | 100% | Mar 2023 | No further processing. | | Flan | [Flan Collection](https://arxiv.org/abs/2301.13688), reproduced following the [original code](https://github.com/google-research/FLAN/tree/main/flan/v2), as performed by [Dettmers et al., (2023)](https://openreview.net/forum?id=OUIFPHEgJU) | Yes | 52.1 | 16.5 | 100% | Feb 2023 | After reproducing Flan, sampled to balance different Flan subsets. Reformatted for pretraining with newlines separating instruction and demonstration. | | CC News | [Common Crawl](https://commoncrawl.org/blog/news-dataset-available) | Yes | 22.0 | 14.3 | 100% | Mar 2023 | Extracted using the Dolma pipeline; new quality filtering and deduplication steps. | | OpenWebMath | [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math) via [Proof Pile II](https://huggingface.co/datasets/EleutherAI/proof-pile-2) | Yes | 2.9 | 12.6 | 100% | May 2023 | Training subset; no further processing. | | Algebraic Stack | [Proof Pile II](https://huggingface.co/datasets/EleutherAI/proof-pile-2) | Yes | 2.8 | 12.6 | 100% | Oct 2023 | Training subset; no further processing. | | Project Gutenberg | [Project Gutenberg](https://www.gutenberg.org) via Dolma v1.6 | No | 0.0556 | 5.3 | 100% | Mar 2023 | Same as Dolma v1.6 | | MegaWika | [MetaWika](https://huggingface.co/datasets/hltcoe/megawika) | Yes | 3.2 | 4.6 | 100% | Jul 2023 | English web pages cited from Wikipedia; curated using the full Dolma pipeline. | | Wikipedia & Wikibooks | [Wikimedia](https://dumps.wikimedia.org) via Dolma v1.6 | No | 6.2 | 3.7 | 200% | Mar 2023 | Same as Dolma v1.6 | | **Total** | | | **2532.0** | **2,308.5** | **1,715.1** | **Oct 2023** | | (A subset of total data was used for training of OLMo 7B-v1.7. The token counts are based on the full dataset, whereas taking into account sampling proportion gives the final actual token counts used for training --- 1.715 trillion tokens.) ## Summary Statistics (v1.6) | **Source** | **Doc Type** | **UTF-8 bytes** (GB) | **Documents** (millions) | **Unicode words** (billions) | **Llama tokens** (billions) | |--|--|--|--|--|--| | Common Crawl | web pages | 9,022 | 3,370 | 1,775 | 2,281 | | The Stack | code| 1,043| 210 | 260| 411 | | C4 | web pages | 790 | 364 | 153| 198 | | Reddit| social media| 339 | 377| 72| 89 | | PeS2o | STEM papers| 268 | 38.8| 50| 70 | | Project Gutenberg | books | 20.4 | 0.056 | 4.0 | 6.0 | | Wikipedia, Wikibooks | encyclopedic | 16.2 | 6.2 | 3.7 | 4.3 | | **Total** | | **11,519** | **4,367** | **2,318** | **3,059** | ## Download The fastest way to download Dolma is to clone this repository and use the files in the `url` directory. We recommend using wget in parallel mode to download the files. For example: ```bash DATA_DIR="<path_to_your_data_directory>" PARALLEL_DOWNLOADS="<number_of_parallel_downloads>" DOLMA_VERSION="<version_of_dolma_to_download>" git clone https://huggingface.co/datasets/allenai/dolma mkdir -p "${DATA_DIR}" cat "dolma/urls/${DOLMA_VERSION}.txt" | xargs -n 1 -P "${PARALLEL_DOWNLOADS}" wget -q -P "$DATA_DIR" ``` Then, to load this data using HuggingFace's `datasets` library, you can use the following code: ```python import os from datasets import load_dataset os.environ["DATA_DIR"] = "<path_to_your_data_directory>" dataset = load_dataset("allenai/dolma", split="train") ``` ### Licensing Information We are releasing this dataset under the terms of [ODC-BY](https://opendatacommons.org/licenses/by/1-0/). By using this dataset, you are also bound any license agreements and terms of use of the original data sources. ## Bibtex If you use our dataset or tooling, please cite us at: ```bibtex @article{dolma, title = {{Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research}}, author={ Luca Soldaini and Rodney Kinney and Akshita Bhagia and Dustin Schwenk and David Atkinson and Russell Authur and Ben Bogin and Khyathi Chandu and Jennifer Dumas and Yanai Elazar and Valentin Hofmann and Ananya Harsh Jha and Sachin Kumar and Li Lucy and Xinxi Lyu and Nathan Lambert and Ian Magnusson and Jacob Morrison and Niklas Muennighoff and Aakanksha Naik and Crystal Nam and Matthew E. Peters and Abhilasha Ravichander and Kyle Richardson and Zejiang Shen and Emma Strubell and Nishant Subramani and Oyvind Tafjord and Pete Walsh and Luke Zettlemoyer and Noah A. Smith and Hannaneh Hajishirzi and Iz Beltagy and Dirk Groeneveld and Jesse Dodge and Kyle Lo }, year = {2024}, journal={arXiv preprint}, } ```

# Dolma ![Dolma官方标识：蓝色背景上的黄色圆润小写字母拼写的"dolma"](https://raw.githubusercontent.com/allenai/dolma/main/docs/assets/AI2_Blog_1400x685_2x.webp) Dolma是一个包含3万亿Token的数据集，数据来源涵盖多样化的网页内容、学术出版物、代码、书籍及百科类资料。更多信息： - 阅读Dolma的**手稿**与**数据表**可前往[ArXiv](https://arxiv.org/abs/2402.00159)查看； - 探索我们用于构建Dolma的[开源工具](https://github.com/allenai/dolma)。 - 若需申请移除个人数据，请使用[此表单](https://forms.gle/q4BNUUxUxKwKkfdT6)通知我们包含特定用户个人可识别信息（PII）的文档。若想了解用于构建Dolma的工具包详情，包括如何复现该数据集，请访问我们的[GitHub项目页面](https://github.com/allenai/dolma/tree/main/docs)！ **2024-04-17：Dolma v1.7版本发布**。我们推出了更新版的Dolma，该版本被用于训练我们最新的[OLMo 7B-v1.7](https://huggingface.co/allenai/OLMo-7b-v1.7)大语言模型。 **2024-04-15：许可证变更**。我们已将Dolma的许可证更新为[ODC-BY](https://opendatacommons.org/licenses/by/1-0/)。详细信息请参阅此[博客文章](https://blog.allenai.org/making-a-switch-dolma-moves-to-odc-by-8f0e73852f44)。 ## 版本目前，Dolma共有六个可用版本： | **版本** | **是否为默认版本** | **发布日期** | **gzip压缩大小** | **描述** | |--|:--:|--|--|--| | `v1_7` | ✅ | 2024-04-15 | 4.5 TB | 用于训练[OLMo-7B-v1.7](https://huggingface.co/allenai/OLMo-7b-v1.7)。新增数据源，优化了质量过滤与模糊去重步骤。 | | `v1_6` | | 2024-01-31 | 5.4 TB | 基于v1.5的更新版本，对Token数过少或重复n-gram过多的文档进行了去重处理。 | | `v1_6-sample` | | 2024-01-31 | 16.4 GB | Dolma的小型采样版本，包含约100亿Token，适用于数据探索。 | | `v1_5` | | 2023-10-31 | 6.4 TB | 用于训练[OLMo-1B](https://huggingface.co/allenai/OLMo-1B)，包含约3万亿Token。 | | `v1_5-sample` | | 2023-10-31 | 2.9 TB | 约1.9万亿Token的采样版本，用于训练[OLMo-7B](https://huggingface.co/allenai/OLMo-7B) | | `v1` | | 2023-08-18 | 6.0 TB | Dolma的首个正式版本。 | ## v1.7版本统计摘要 | **数据源** | **来源出处** | **是否新增/更新** | **文档数（百万）** | **OLMo Token数（十亿）** | **采样比例** | **截止日期** | **处理流程** | |--|--|--|--|--|--|--|--| | Dolma的Common Crawl数据 | 通过Dolma v1.6获取自[Common Crawl](https://commoncrawl.org/) | 已更新 | 875.2 | 1,195.5 | 50% | 2023年3月 | 通过Dolma流水线提取；新增质量过滤与去重步骤。 | | 精炼网页数据集（Refined Web） | 通过[Refined Web](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)获取 | 新增 | 664.0 | 456.4 | 100% | 2023年2月 | 通过Dolma流水线过滤；新增质量过滤与去重步骤。 | | StarCoder | 通过[StarCoder](https://huggingface.co/blog/starcoder)获取 | 新增 | 206.6 | 263.8 | 100% | 2023年5月 | 未进行额外处理。 | | C4 | 通过Dolma v1.6获取自[C4](https://huggingface.co/datasets/c4) | 已更新 | 249.9 | 138.4 | 50% | 2019年4月 | 通过Dolma流水线过滤；新增质量过滤与去重步骤。 | | Reddit | 通过PushShift API获取 | 已更新 | 377.4 | 79.9 | 100% | 2023年3月 | 通过Dolma流水线提取；新增质量过滤与去重步骤。 | | Semantic Scholar（[S2ORC](https://aclanthology.org/2020.acl-main.447/) & [S2AG](https://www.semanticscholar.org/product/api)） | 通过Dolma v1.6获取自[peS2o](https://huggingface.co/datasets/allenai/peS2o) | 未变更 | 38.8 | 57.2 | 100% | 2023年3月 | 与Dolma v1.6一致 | | arXiv | 通过[RedPajama v1](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)获取 | 新增 | 1.5 | 28.0 | 100% | 2023年3月 | 未进行额外处理。 | | StackExchange | 通过[RedPajama v1](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)获取 | 新增 | 29.3 | 19.6 | 100% | 2023年3月 | 未进行额外处理。 | | Flan | 通过[Flan Collection](https://arxiv.org/abs/2301.13688)获取，按照[原始代码](https://github.com/google-research/FLAN/tree/main/flan/v2)复现，处理方式与[Dettmers等人（2023）](https://openreview.net/forum?id=OUIFPHEgJU)一致 | 新增 | 52.1 | 16.5 | 100% | 2023年2月 | 复现Flan后，对不同Flan子集进行采样以平衡分布；重新格式化以适配预训练，使用换行符分隔指令与演示样本。 | | CC News | 通过[Common Crawl新闻数据集](https://commoncrawl.org/blog/news-dataset-available)获取 | 新增 | 22.0 | 14.3 | 100% | 2023年3月 | 通过Dolma流水线提取；新增质量过滤与去重步骤。 | | OpenWebMath | 通过[Proof Pile II](https://huggingface.co/datasets/EleutherAI/proof-pile-2)获取自[OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math) | 新增 | 2.9 | 12.6 | 100% | 2023年5月 | 直接使用训练子集，未进行额外处理。 | | 代数栈数据集（Algebraic Stack） | 通过[Proof Pile II](https://huggingface.co/datasets/EleutherAI/proof-pile-2)获取 | 新增 | 2.8 | 12.6 | 100% | 2023年10月 | 直接使用训练子集，未进行额外处理。 | | Project Gutenberg | 通过Dolma v1.6获取自[Project Gutenberg](https://www.gutenberg.org) | 未变更 | 0.0556 | 5.3 | 100% | 2023年3月 | 与Dolma v1.6一致 | | MegaWika | 通过[MetaWika](https://huggingface.co/datasets/hltcoe/megawika)获取 | 新增 | 3.2 | 4.6 | 100% | 2023年7月 | 从维基百科引用的英文网页；通过完整Dolma流水线进行整理。 | | Wikipedia & Wikibooks | 通过Dolma v1.6获取自[Wikimedia](https://dumps.wikimedia.org) | 未变更 | 6.2 | 3.7 | 200% | 2023年3月 | 与Dolma v1.6一致 | | **总计** | | | **2532.0** | **2,308.5** | **1,715.1** | **2023年10月** | | （OLMo 7B-v1.7的训练仅使用了总数据的子集。上述Token计数基于完整数据集，结合采样比例后，最终用于训练的实际Token总数为1.715万亿。） ## v1.6版本统计摘要 | **数据源** | **文档类型** | **UTF-8字节数（GB）** | **文档数（百万）** | **Unicode词汇数（十亿）** | **Llama Token数（十亿）** | |--|--|--|--|--|--| | Common Crawl | 网页 | 9,022 | 3,370 | 1,775 | 2,281 | | The Stack | 代码 | 1,043 | 210 | 260 | 411 | | C4 | 网页 | 790 | 364 | 153 | 198 | | Reddit | 社交媒体 | 339 | 377 | 72 | 89 | | PeS2o | STEM学术论文 | 268 | 38.8 | 50 | 70 | | Project Gutenberg | 书籍 | 20.4 | 0.056 | 4.0 | 6.0 | | Wikipedia, Wikibooks | 百科资料 | 16.2 | 6.2 | 3.7 | 4.3 | | **总计** | | **11,519** | **4,367** | **2,318** | **3,059** | ## 下载获取Dolma的最快方式是克隆此仓库并使用`url`目录下的文件。我们推荐使用并行模式的wget工具进行下载，示例如下： bash DATA_DIR="<你的数据目录路径>" PARALLEL_DOWNLOADS="<并行下载任务数>" DOLMA_VERSION="<要下载的Dolma版本>" git clone https://huggingface.co/datasets/allenai/dolma mkdir -p "${DATA_DIR}" cat "dolma/urls/${DOLMA_VERSION}.txt" | xargs -n 1 -P "${PARALLEL_DOWNLOADS}" wget -q -P "$DATA_DIR" 随后，你可以使用HuggingFace的`datasets`库加载该数据，示例代码如下： python import os from datasets import load_dataset os.environ["DATA_DIR"] = "<你的数据目录路径>" dataset = load_dataset("allenai/dolma", split="train") ### 许可证信息本数据集基于[ODC-BY](https://opendatacommons.org/licenses/by/1-0/)协议发布。使用本数据集的同时，你也需遵守各原始数据源的许可协议与使用条款。 ## 参考文献若你使用了本数据集或相关工具，请引用如下文献： bibtex @article{dolma, title = {{Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research}}, author={ Luca Soldaini and Rodney Kinney and Akshita Bhagia and Dustin Schwenk and David Atkinson and Russell Authur and Ben Bogin and Khyathi Chandu and Jennifer Dumas and Yanai Elazar and Valentin Hofmann and Ananya Harsh Jha and Sachin Kumar and Li Lucy and Xinxi Lyu and Nathan Lambert and Ian Magnusson and Jacob Morrison and Niklas Muennighoff and Aakanksha Naik and Crystal Nam and Matthew E. Peters and Abhilasha Ravichander and Kyle Richardson and Zejiang Shen and Emma Strubell and Nishant Subramani and Oyvind Tafjord and Pete Walsh and Luke Zettlemoyer and Noah A. Smith and Hannaneh Hajishirzi and Iz Beltagy and Dirk Groeneveld and Jesse Dodge and Kyle Lo }, year = {2024}, journal={arXiv preprint}, }

提供机构：

maas

创建时间：

2024-11-21

搜集汇总

数据集介绍