下载链接：

https://modelscope.cn/datasets/mlfoundations/dclm-baseline-1.0

下载链接

链接失效反馈

官方服务：

资源简介：

## DCLM-baseline DCLM-baseline is a 4T token / 3B document pretraining dataset that achieves strong performance on language model benchmarks. Below are comparisions of model trained on DCLM-baseline with other models in the 7B regime. | Model | Params | Tokens | Open dataset? | CORE | MMLU | EXTENDED | |---------------|--------|--------|---------------|----------|----------|----------| | **Open weights, closed datasets** | | | | | | | | Llama2 | 7B | 2T | ✗ | 49.2 | 45.8 | 34.1 | | DeepSeek | 7B | 2T | ✗ | 50.7 | 48.5 | 35.3 | | Mistral-0.3 | 7B | ? | ✗ | 57.0 | 62.7 | 45.1 | | QWEN-2 | 7B | ? | ✗ | 57.5 | **71.9** | 50.5 | | Llama3 | 8B | 15T | ✗ | 57.6 | 66.2 | 46.3 | | Gemma | 8B | 6T | ✗ | 57.8 | 64.3 | 44.6 | | Phi-3 | 7B | ? | ✗ | **61.0** | 69.9 | **57.9** | | **Open weights, open datasets** | | | | | | | | Falcon | 7B | 1T | ✓ | 44.1 | 27.4 | 25.1 | | Amber | 7B | 1.2T | ✓ | 39.8 | 27.9 | 22.3 | | Crystal | 7B | 1.2T | ✓ | 48.0 | 48.2 | 33.2 | | OLMo-1.7 | 7B | 2.1T | ✓ | 47.0 | 54.0 | 34.2 | | MAP-Neo | 7B | 4.5T | ✓ | **50.2** | **57.1** | **40.4** | | **Models we trained** | | | | | | | | FineWeb edu | 7B | 0.14T | ✓ | 38.7 | 26.3 | 22.1 | | FineWeb edu | 7B | 0.28T | ✓ | 41.9 | 37.3 | 24.5 | | **DCLM-BASELINE** | 7B | 0.14T | ✓ | 44.1 | 38.3 | 25.0 | | **DCLM-BASELINE** | 7B | 0.28T | ✓ | 48.9 | 50.8 | 31.8 | | **DCLM-BASELINE** | 7B | 2.6T | ✓ | **57.1** | **63.7** | **45.4** | ## Dataset Details ### Dataset Description - **Curated by:** The DCLM Team - **Language(s) (NLP):** English - **License:** CC-by-4.0 ### Dataset Sources - **Repository:** https://datacomp.ai/dclm - **Paper:**: https://arxiv.org/abs/2406.11794 - **Construction Code**: https://github.com/mlfoundations/dclm ## Uses ### Direct Use DCLM-Baseline is intended to be used as a research baseline for the DCLM benchmark. It demonstrates the importance of data curation in training performant language models. ### Out-of-Scope Use DCLM-Baseline is not intended for training production-ready models or for specific domains such as code and math. It may not perform as well as domain-specific datasets for these tasks. Due to these limitations, the dataset is intended for research use only. DCLM-Baseline is a subset of the DCLM-Pool, which is a corpus of 240 trillion tokens derived from Common Crawl. The dataset is in plain text format. ## Dataset Creation ### Curation Rationale DCLM-Baseline was created to demonstrate the effectiveness of the DCLM testbed in developing high-quality training sets for language models. It serves as a proof of concept for the data curation strategies enabled by DCLM and is designed to be a research baseline for the benchmark. ### Source Data #### Data Collection and Processing DCLM-Baseline was created by applying a series of cleaning, filtering, and deduplication steps to the raw Common Crawl data (DCLM-Pool). The key steps include: 1. Heuristic cleaning and filtering (reproduction of RefinedWeb) 2. Deduplication using a Bloom filter 3. Model-based filtering using a fastText classifier trained on instruction-formatted data (OpenHermes 2.5 and r/ExplainLikeImFive) #### Who are the source data producers? The source data is from Common Crawl, which is a repository of web crawl data. ### Personal and Sensitive Information [More Information Needed] ## Bias, Risks, and Limitations The dataset may contain biases present in the Common Crawl data. The dataset's performance on code and math tasks is limited compared to its performance on language understanding tasks. DCLM-Baseline is designed for research purposes only. ### Recommendations Users should be aware of the potential biases and limitations of the dataset, especially when using it for specific domains like code and math. The dataset should only be used for research purposes in the context of the DCLM benchmark. ## Citation ```bibtex @misc{li2024datacomplm, title={DataComp-LM: In search of the next generation of training sets for language models}, author={Jeffrey Li and Alex Fang and Georgios Smyrnis and Maor Ivgi and Matt Jordan and Samir Gadre and Hritik Bansal and Etash Guha and Sedrick Keh and Kushal Arora and Saurabh Garg and Rui Xin and Niklas Muennighoff and Reinhard Heckel and Jean Mercat and Mayee Chen and Suchin Gururangan and Mitchell Wortsman and Alon Albalak and Yonatan Bitton and Marianna Nezhurina and Amro Abbas and Cheng-Yu Hsieh and Dhruba Ghosh and Josh Gardner and Maciej Kilian and Hanlin Zhang and Rulin Shao and Sarah Pratt and Sunny Sanyal and Gabriel Ilharco and Giannis Daras and Kalyani Marathe and Aaron Gokaslan and Jieyu Zhang and Khyathi Chandu and Thao Nguyen and Igor Vasiljevic and Sham Kakade and Shuran Song and Sujay Sanghavi and Fartash Faghri and Sewoong Oh and Luke Zettlemoyer and Kyle Lo and Alaaeldin El-Nouby and Hadi Pouransari and Alexander Toshev and Stephanie Wang and Dirk Groeneveld and Luca Soldaini and Pang Wei Koh and Jenia Jitsev and Thomas Kollar and Alexandros G. Dimakis and Yair Carmon and Achal Dave and Ludwig Schmidt and Vaishaal Shankar}, year={2024}, eprint={2406.11794}, archivePrefix={arXiv}, primaryClass={id='cs.LG' full_name='Machine Learning' is_active=True alt_name=None in_archive='cs' is_general=False description='Papers on all aspects of machine learning research (supervised, unsupervised, reinforcement learning, bandit problems, and so on) including also robustness, explanation, fairness, and methodology. cs.LG is also an appropriate primary category for applications of machine learning methods.'} ```

# DCLM基线（DCLM-baseline） DCLM基线是一个包含4万亿Token、30亿文档的预训练数据集，在各类语言模型基准测试中均能取得优异的性能表现。以下为基于DCLM基线训练的模型与7B参数区间其他模型的性能对比。 | 模型 | 参数量 | 训练Token数 | 是否开源数据集 | CORE基准 | MMLU基准 | EXTENDED基准 | |---------------|--------|--------|---------------|----------|----------|----------| | **开源权重、闭源数据集** | | | | | | | | Llama2 | 7B | 2T | ✗ | 49.2 | 45.8 | 34.1 | | DeepSeek | 7B | 2T | ✗ | 50.7 | 48.5 | 35.3 | | Mistral-0.3 | 7B | ? | ✗ | 57.0 | 62.7 | 45.1 | | QWEN-2 | 7B | ? | ✗ | 57.5 | **71.9** | 50.5 | | Llama3 | 8B | 15T | ✗ | 57.6 | 66.2 | 46.3 | | Gemma | 8B | 6T | ✗ | 57.8 | 64.3 | 44.6 | | Phi-3 | 7B | ? | ✗ | **61.0** | 69.9 | **57.9** | | **开源权重、开源数据集** | | | | | | | | Falcon | 7B | 1T | ✓ | 44.1 | 27.4 | 25.1 | | Amber | 7B | 1.2T | ✓ | 39.8 | 27.9 | 22.3 | | Crystal | 7B | 1.2T | ✓ | 48.0 | 48.2 | 33.2 | | OLMo-1.7 | 7B | 2.1T | ✓ | 47.0 | 54.0 | 34.2 | | MAP-Neo | 7B | 4.5T | ✓ | **50.2** | **57.1** | **40.4** | | **我们训练的模型** | | | | | | | | FineWeb edu | 7B | 0.14T | ✓ | 38.7 | 26.3 | 22.1 | | FineWeb edu | 7B | 0.28T | ✓ | 41.9 | 37.3 | 24.5 | | **DCLM基线（DCLM-BASELINE）** | 7B | 0.14T | ✓ | 44.1 | 38.3 | 25.0 | | **DCLM基线（DCLM-BASELINE）** | 7B | 0.28T | ✓ | 48.9 | 50.8 | 31.8 | | **DCLM基线（DCLM-BASELINE）** | 7B | 2.6T | ✓ | **57.1** | **63.7** | **45.4** | ## 数据集详情 ### 数据集描述 - **整理方**：DCLM团队 - **自然语言处理语言**：英语 - **许可协议**：知识共享署名4.0许可（CC-by-4.0） ### 数据集来源 - **代码仓库**：https://datacomp.ai/dclm - **论文链接**：https://arxiv.org/abs/2406.11794 - **构建代码**：https://github.com/mlfoundations/dclm ## 使用场景 ### 直接使用场景 DCLM基线旨在作为DCLM基准测试的研究基线，用以验证数据清洗在训练高性能语言模型中的重要性。 ### 超出范围的使用场景 DCLM基线不适用于训练可投入生产的模型，也不适用于代码、数学等特定领域任务。相较于针对这些领域定制的数据集，其在对应任务上的表现可能不佳。基于上述局限性，本数据集仅可用于研究用途。 DCLM基线是DCLM-Pool的子集，而DCLM-Pool是一个源自通用爬虫（Common Crawl）、包含240万亿Token的语料库。本数据集采用纯文本格式存储。 ## 数据集构建 ### 整理初衷 DCLM基线旨在验证DCLM测试平台在构建高质量语言模型训练集方面的有效性，作为DCLM所支持的数据清洗策略的概念验证原型，并被设计为该基准测试的研究基线。 ### 源数据 #### 数据收集与处理流程 DCLM基线通过对原始通用爬虫（Common Crawl）数据（即DCLM-Pool）执行一系列清洗、过滤与去重步骤构建而成，核心步骤包括： 1. 启发式清洗与过滤（复刻RefinedWeb数据集的处理流程） 2. 使用布隆过滤器（Bloom filter）进行去重 3. 基于fastText分类器的模型级过滤，该分类器基于指令格式数据（OpenHermes 2.5与r/ExplainLikeImFive）训练得到 #### 源数据生产者本数据集的源数据来自通用爬虫（Common Crawl），一个公开的网页爬取数据仓库。 ### 个人与敏感信息 [需补充更多信息] ## 偏差、风险与局限性本数据集可能包含通用爬虫数据中固有的各类偏差。相较于语言理解任务，其在代码与数学任务上的性能表现有限。DCLM基线仅适用于研究用途。 ### 使用建议用户应充分意识到本数据集存在的潜在偏差与局限性，尤其是在将其用于代码、数学等特定领域任务时。本数据集仅可用于DCLM基准测试相关的研究工作。 ## 引用文献 bibtex @misc{li2024datacomplm, title={"DataComp-LM: In search of the next generation of training sets for language models"}, author={Jeffrey Li and Alex Fang and Georgios Smyrnis and Maor Ivgi and Matt Jordan and Samir Gadre and Hritik Bansal and Etash Guha and Sedrick Keh and Kushal Arora and Saurabh Garg and Rui Xin and Niklas Muennighoff and Reinhard Heckel and Jean Mercat and Mayee Chen and Suchin Gururangan and Mitchell Wortsman and Alon Albalak and Yonatan Bitton and Marianna Nezhurina and Amro Abbas and Cheng-Yu Hsieh and Dhruba Ghosh and Josh Gardner and Maciej Kilian and Hanlin Zhang and Rulin Shao and Sarah Pratt and Sunny Sanyal and Gabriel Ilharco and Giannis Daras and Kalyani Marathe and Aaron Gokaslan and Jieyu Zhang and Khyathi Chandu and Thao Nguyen and Igor Vasiljevic and Sham Kakade and Shuran Song and Sujay Sanghavi and Fartash Faghri and Sewoong Oh and Luke Zettlemoyer and Kyle Lo and Alaaeldin El-Nouby and Hadi Pouransari and Alexander Toshev and Stephanie Wang and Dirk Groeneveld and Luca Soldaini and Pang Wei Koh and Jenia Jitsev and Thomas Kollar and Alexandros G. Dimakis and Yair Carmon and Achal Dave and Ludwig Schmidt and Vaishaal Shankar}, year={2024}, eprint={2406.11794}, archivePrefix={arXiv}, primaryClass={id='cs.LG' full_name='Machine Learning' is_active=True alt_name=None in_archive='cs' is_general=False description='Papers on all aspects of machine learning research (supervised, unsupervised, reinforcement learning, bandit problems, and so on) including also robustness, explanation, fairness, and methodology. cs.LG is also an appropriate primary category for applications of machine learning methods.'}

应用场景：