smollm-corpus
收藏魔搭社区2026-01-10 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/HuggingFaceTB/smollm-corpus
下载链接
链接失效反馈官方服务:
资源简介:
# SmolLM-Corpus
This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models.
You can find more details about the models trained on this dataset in our [SmolLM blog post](https://huggingface.co/blog/smollm).
# Dataset subsets
## Cosmopedia v2
Cosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 39 million textbooks, blog posts, and stories generated by [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1).
Most of the samples are generated by prompting the model to generate content on specific topics using a web page referred to as a "seed sample," as shown in Figure 1. We use web samples to increase diversity and expand the range of prompts.
You can find more details in this [blog post](https://huggingface.co/blog/smollm).
### Dataset Features
* `prompt (string)`: The input prompt used to generate the text.
* `text (string)`: The generated text content.
* `token_length (int64)`: The length of the text in tokens (Mistral-7B tokenizer).
* `audience (string)`: The intended audience for the content.
* `format (string)`: The format of the content (e.g., textbook, story).
* `seed_data (string)`: The seed sample used to generate the text.
### Loading the dataset
```python
from datasets import load_dataset
ds = load_dataset("HuggingFaceTB/smollm-corpus", "cosmopedia-v2", split="train", num_proc=16)
print(ds[0])
```
## Python-Edu
The `python-edu` subset consists of Python files that were scored 4 or more by the [educational code model](https://huggingface.co/HuggingFaceTB/python-edu-scorer).
The files were extracted from the [`stack-v2-train`](https://huggingface.co/datasets/bigcode/the-stack-v2-train-full-ids) dataset.
### Dataset Features
* `blob_id (string)`: Software Heritage (SWH) ID of the file on AWS S3.
* `repo_name (string)`: Repository name on GitHub.
* `path (string)`: The file path within the repository.
* `length_bytes (int64)`: Length of the file content in UTF-8 bytes.
* `score (float32)`: The output of the educational scoring model.
* `int_score (uint8)`: The rounded educational score.
### Downloading the data
The file contents are downloaded from Software Heritage's S3 bucket to ensure data compliance.
Please refer to [the-stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2-train-full-ids) for the data license.
When running on a 16-core AWS `us-east-1` instance, this script takes ~6 hours to download the files:
```python
import boto3
import gzip
from datasets import load_dataset
from botocore.exceptions import ClientError
num_proc = 16
s3 = boto3.client('s3')
bucket_name = "softwareheritage"
def download_contents(blob_id):
key = f"content/{blob_id}"
try:
obj = s3.get_object(Bucket=bucket_name, Key=key)
with gzip.GzipFile(fileobj=obj['Body']) as fin:
content = fin.read().decode("utf-8", errors="ignore")
return {"text": content, "download_success": True}
except ClientError as e:
if e.response['Error']['Code'] == 'NoSuchKey':
print(f"File not found: {key}")
return {"text": "", "download_success": False}
else:
raise
ds = load_dataset("HuggingFaceTB/smollm-corpus", "python-edu", split="train", num_proc=num_proc)
ds = ds.map(download_contents, input_columns="blob_id", num_proc=num_proc)
# Filter out failed downloads
ds = ds.filter(lambda x: x['download_success'])
# Optionally, print the first example to verify the data
print(ds[0])
```
## FineWeb-Edu (deduplicated)
FineWeb-Edu-Dedup is a deduplicated subset of the [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) dataset, containing 220 billion tokens of educational web pages.
The source dataset was filtered using an educational quality classifier to retain only the highest quality educational content.
For more information refer to the [FineWeb-v1 blog post](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1)
### Dataset Features
* `text (string)`: The web page's text content.
* `id (string)`: Unique ID of the web page.
* `metadata (struct)`: Metadata about the web page, including:
* `dump (string)`: The source CommonCrawl dump.
* `url (string)`: The URL of the web page.
* `date (timestamp[s])`: The date the web page was captured.
* `file_path (string)`: The file path of the commoncrawl snapshot.
* `language (string)`: The language of the web page.
* `language_score (float64)`: The language probability.
* `token_count (int64)`: The token count of the web page (gpt2 tokenizer).
* `score (float64)`: The educational quality score.
* `int_score (int64)`: The rounded educational quality score.
### Loading the dataset
```python
from datasets import load_dataset
ds = load_dataset("HuggingFaceTB/smollm-corpus", "fineweb-edu-dedup", split="train", num_proc=16)
print(ds[0])
```
## Citation
```
@software{benallal2024smollmcorpus,
author = {Ben Allal, Loubna and Lozhkov, Anton and Penedo, Guilherme and Wolf, Thomas and von Werra, Leandro},
title = {SmolLM-Corpus},
month = July,
year = 2024,
url = {https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus}
}
```
# SmolLM-Corpus
本数据集为经过精选的高质量教育与合成数据合集,专为训练小型语言模型(small language model)设计。你可通过我们的[SmolLM博客文章](https://huggingface.co/blog/smollm)了解更多基于本数据集训练的模型细节。
## 数据集子集
### Cosmopedia v2
Cosmopedia v2 是预训练领域规模最大的合成数据集Cosmopedia的增强版本,包含由[Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)生成的超过3900万份教科书、博客文章与故事样本。绝大多数样本通过以下方式生成:以被称为“种子样本”的网页作为提示,引导模型针对特定主题生成内容,如图1所示。我们采用网页样本以提升数据多样性并拓展提示词范围。更多细节可参阅此[博客文章](https://huggingface.co/blog/smollm)。
#### 数据集特征
* `prompt(字符串)`:用于生成文本的输入提示词。
* `text(字符串)`:生成的文本内容。
* `token_length(int64类型)`:以Mistral-7B分词器(Mistral-7B tokenizer)统计的文本Token长度。
* `audience(字符串)`:内容的目标受众。
* `format(字符串)`:内容的格式(例如教科书、故事)。
* `seed_data(字符串)`:用于生成文本的种子样本。
#### 加载数据集
python
from datasets import load_dataset
ds = load_dataset("HuggingFaceTB/smollm-corpus", "cosmopedia-v2", split="train", num_proc=16)
print(ds[0])
## Python-Edu
`python-edu`子集包含由[教育代码模型(educational code model)](https://huggingface.co/HuggingFaceTB/python-edu-scorer)评分不低于4分的Python代码文件。这些文件源自[`stack-v2-train`](https://huggingface.co/datasets/bigcode/the-stack-v2-train-full-ids)数据集。
### 数据集特征
* `blob_id(字符串)`:AWS S3上该文件的软件源代码仓储(Software Heritage, SWH)ID。
* `repo_name(字符串)`:GitHub上的仓库名称。
* `path(字符串)`:仓库内的文件路径。
* `length_bytes(int64类型)`:以UTF-8字节统计的文件内容长度。
* `score(float32类型)`:教育评分模型的输出结果。
* `int_score(uint8类型)`:取整后的教育评分。
### 数据下载
为确保数据合规性,文件内容从Software Heritage的S3存储桶中下载。有关数据许可的更多信息,请参阅[the-stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2-train-full-ids)。
在配备16核心的AWS `us-east-1`实例上运行此脚本时,下载过程约需6小时:
python
import boto3
import gzip
from datasets import load_dataset
from botocore.exceptions import ClientError
num_proc = 16
s3 = boto3.client('s3')
bucket_name = "softwareheritage"
def download_contents(blob_id):
key = f"content/{blob_id}"
try:
obj = s3.get_object(Bucket=bucket_name, Key=key)
with gzip.GzipFile(fileobj=obj['Body']) as fin:
content = fin.read().decode("utf-8", errors="ignore")
return {"text": content, "download_success": True}
except ClientError as e:
if e.response['Error']['Code'] == 'NoSuchKey':
print(f"File not found: {key}")
return {"text": "", "download_success": False}
else:
raise
ds = load_dataset("HuggingFaceTB/smollm-corpus", "python-edu", split="train", num_proc=num_proc)
ds = ds.map(download_contents, input_columns="blob_id", num_proc=num_proc)
# Filter out failed downloads
ds = ds.filter(lambda x: x['download_success'])
# Optionally, print the first example to verify the data
print(ds[0])
## FineWeb-Edu(去重版)
FineWeb-Edu-Dedup是[FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)数据集的去重子集,包含2200亿Token的教育类网页数据。原始数据集通过教育质量分类器进行过滤,仅保留最高质量的教育内容。更多信息请参阅[FineWeb-v1博客文章](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1)
### 数据集特征
* `text(字符串)`:网页的文本内容。
* `id(字符串)`:网页的唯一标识符。
* `metadata(结构体)`:网页的元数据,包括:
* `dump(字符串)`:来源通用爬虫(CommonCrawl)转储文件。
* `url(字符串)`:网页的URL。
* `date(timestamp[s])`:网页的捕获时间。
* `file_path(字符串)`:CommonCrawl快照的文件路径。
* `language(字符串)`:网页的语言。
* `language_score(float64类型)`:语言概率得分。
* `token_count(int64类型)`:网页的Token数量(采用GPT-2分词器(gpt2 tokenizer)统计)。
* `score(float64类型)`:教育质量评分。
* `int_score(int64类型)`:取整后的教育质量评分。
### 加载数据集
python
from datasets import load_dataset
ds = load_dataset("HuggingFaceTB/smollm-corpus", "fineweb-edu-dedup", split="train", num_proc=16)
print(ds[0])
## 引用
@software{benallal2024smollmcorpus,
author = {Ben Allal, Loubna and Lozhkov, Anton and Penedo, Guilherme and Wolf, Thomas and von Werra, Leandro},
title = {SmolLM-Corpus},
month = July,
year = 2024,
url = {https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus}
}
提供机构:
maas
创建时间:
2025-09-08



