Hugodonotexit/dolma-en

Name: Hugodonotexit/dolma-en
Creator: Hugodonotexit
Published: 2026-02-15 19:25:24
License: 暂无描述

Hugging Face2026-02-15 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Hugodonotexit/dolma-en

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: Dolma English language: - en license: apache-2.0 task_categories: - text-generation tags: - dolma - english --- # Dolma-English ## Overview This dataset is a filtered subset of the Dolma corpus, restricted to English-only documents and further constrained by a minimum document length threshold. It is intended for training and evaluating large language models and other NLP systems that benefit from higher-quality, sufficiently long English text. The primary goals of this dataset are: * To reduce multilingual and very short/noisy content present in the raw Dolma corpus. * To provide a cleaner, more model-ready English text collection. ## Source Dataset The original data is derived from **Dolma**, a large-scale, open text corpus constructed from diverse web and document sources for language model pretraining. Upstream project: * Dolma: [https://github.com/allenai/dolma](https://github.com/allenai/dolma) Please refer to the Dolma documentation for details about the original data collection methodology, source composition, and preprocessing pipeline. ## Filtering and Processing The following filters were applied to the original Dolma corpus: 1. **Language Filter (English Only)** * Documents were retained only if they were classified as English by a language identification model. 2. **Minimum Length Filter** * Documents shorter than a specified minimum length were removed. * Length is measured in characters or tokens (depending on the preprocessing configuration used during dataset construction). 3. **Basic Cleaning (if applicable)** * Removal of empty or malformed records. * Normalization of whitespace. ## Dataset Structure Each record in the dataset contains the following fields: * `text` (string): The full English document text. ## Splits This dataset contains exactly two splits: * `train` * Derived from **Dolma v1.7**. * Filtered to English-only documents and a minimum length threshold. * `validation` * Derived from the **Dolma v1.6 sample**. * Filtered using the same English-only and minimum length criteria as the training data. ## License This dataset inherits the licensing terms of the original Dolma corpus and its upstream sources. Please consult the Dolma repository and associated documentation for full licensing details: * [https://github.com/allenai/dolma](https://github.com/allenai/dolma) ## Citation If you use this dataset in your work, please cite the Dolma corpus: ```bibtex @article{dolma2023, title = {Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining}, author = {Soldaini, Luca and others}, journal = {arXiv preprint arXiv:2305.16938}, year = {2023} } ``` You may also reference this filtered dataset as: ```bibtex @dataset{dolma_english_minlength, title = {Dolma English (Minimum Length Filtered)}, author = {Hugodonotexit}, year = {2026}, url = {https://huggingface.co/datasets/Hugodonotexit/dolma-en} } ``` ## Acknowledgements This dataset is based on the Dolma corpus created and released by the Allen Institute for AI (AI2). We thank the original authors and contributors for making the data publicly available.

提供机构：

Hugodonotexit

5,000+

优质数据集

54 个

任务类型

进入经典数据集