the-stack-smol
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/bigcode/the-stack-smol
下载链接
链接失效反馈官方服务:
资源简介:
## Dataset Description

A small subset (~0.1%) of [the-stack](https://huggingface.co/datasets/bigcode/the-stack) dataset, each programming language has 10,000 random samples from the original dataset. The dataset has 2.6GB of text (code).
## Languages
The dataset contains 30 programming languages:
````
"assembly", "batchfile", "c++", "c", "c-sharp", "cmake", "css", "dockerfile", "fortran", "go", "haskell", "html", "java",
"javascript", "julia", "lua", "makefile", "markdown", "perl", "php", "powershell", "python", "ruby", "rust",
"scala", "shell", "sql", "tex", "typescript", "visual-basic"
`````
## Dataset Structure
```python
from datasets import load_dataset
load_dataset("bigcode/the-stack-smol")
DatasetDict({
train: Dataset({
features: ['content', 'avg_line_length', 'max_line_length', 'alphanum_fraction', 'licenses', 'repository_name', 'path', 'size', 'lang'],
num_rows: 300000
})
})
```
### How to use it
You can either load the whole dataset like above, or load a specific language such as python by specifying the folder directory:
```python
load_dataset("bigcode/the-stack-smol", data_dir="data/python")
DatasetDict({
train: Dataset({
features: ['content', 'avg_line_length', 'max_line_length', 'alphanum_fraction', 'licenses', 'repository_name', 'path', 'size', 'lang'],
num_rows: 10000
})
})
```
## 数据集描述

这是[the-stack](https://huggingface.co/datasets/bigcode/the-stack)数据集的一个小子集(约0.1%),每种编程语言均从原始数据集随机抽取10000个样本。该数据集包含2.6GB的文本(代码)。
## 编程语言
该数据集包含30种编程语言:
`
"assembly", "batchfile", "c++", "c", "c-sharp", "cmake", "css", "dockerfile", "fortran", "go", "haskell", "html", "java",
"javascript", "julia", "lua", "makefile", "markdown", "perl", "php", "powershell", "python", "ruby", "rust",
"scala", "shell", "sql", "tex", "typescript", "visual-basic"
``
## 数据集结构
python
from datasets import load_dataset
load_dataset("bigcode/the-stack-smol")
DatasetDict({
train: Dataset({
features: ['content', 'avg_line_length', 'max_line_length', 'alphanum_fraction', 'licenses', 'repository_name', 'path', 'size', 'lang'],
num_rows: 300000
})
})
### 如何使用
您可以按照上述方式加载整个数据集,也可以通过指定文件夹目录加载特定语言(如Python):
python
load_dataset("bigcode/the-stack-smol", data_dir="data/python")
DatasetDict({
train: Dataset({
features: ['content', 'avg_line_length', 'max_line_length', 'alphanum_fraction', 'licenses', 'repository_name', 'path', 'size', 'lang'],
num_rows: 10000
})
})
提供机构:
maas
创建时间:
2025-10-11



