the-stack-smol

Name: the-stack-smol
Creator: maas
Published: 2025-12-05 11:37:37
License: 暂无描述

魔搭社区2025-12-05 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/bigcode/the-stack-smol

下载链接

链接失效反馈

官方服务：

资源简介：

## Dataset Description ![Smol](https://huggingface.co/datasets/bigcode/admin/resolve/main/smol.png) A small subset (~0.1%) of [the-stack](https://huggingface.co/datasets/bigcode/the-stack) dataset, each programming language has 10,000 random samples from the original dataset. The dataset has 2.6GB of text (code). ## Languages The dataset contains 30 programming languages: ```` "assembly", "batchfile", "c++", "c", "c-sharp", "cmake", "css", "dockerfile", "fortran", "go", "haskell", "html", "java", "javascript", "julia", "lua", "makefile", "markdown", "perl", "php", "powershell", "python", "ruby", "rust", "scala", "shell", "sql", "tex", "typescript", "visual-basic" ````` ## Dataset Structure ```python from datasets import load_dataset load_dataset("bigcode/the-stack-smol") DatasetDict({ train: Dataset({ features: ['content', 'avg_line_length', 'max_line_length', 'alphanum_fraction', 'licenses', 'repository_name', 'path', 'size', 'lang'], num_rows: 300000 }) }) ``` ### How to use it You can either load the whole dataset like above, or load a specific language such as python by specifying the folder directory: ```python load_dataset("bigcode/the-stack-smol", data_dir="data/python") DatasetDict({ train: Dataset({ features: ['content', 'avg_line_length', 'max_line_length', 'alphanum_fraction', 'licenses', 'repository_name', 'path', 'size', 'lang'], num_rows: 10000 }) }) ```

## 数据集描述 ![Smol](https://huggingface.co/datasets/bigcode/admin/resolve/main/smol.png) 这是[the-stack](https://huggingface.co/datasets/bigcode/the-stack)数据集的一个小子集（约0.1%），每种编程语言均从原始数据集随机抽取10000个样本。该数据集包含2.6GB的文本（代码）。 ## 编程语言该数据集包含30种编程语言： ` "assembly", "batchfile", "c++", "c", "c-sharp", "cmake", "css", "dockerfile", "fortran", "go", "haskell", "html", "java", "javascript", "julia", "lua", "makefile", "markdown", "perl", "php", "powershell", "python", "ruby", "rust", "scala", "shell", "sql", "tex", "typescript", "visual-basic" `` ## 数据集结构 python from datasets import load_dataset load_dataset("bigcode/the-stack-smol") DatasetDict({ train: Dataset({ features: ['content', 'avg_line_length', 'max_line_length', 'alphanum_fraction', 'licenses', 'repository_name', 'path', 'size', 'lang'], num_rows: 300000 }) }) ### 如何使用您可以按照上述方式加载整个数据集，也可以通过指定文件夹目录加载特定语言（如Python）： python load_dataset("bigcode/the-stack-smol", data_dir="data/python") DatasetDict({ train: Dataset({ features: ['content', 'avg_line_length', 'max_line_length', 'alphanum_fraction', 'licenses', 'repository_name', 'path', 'size', 'lang'], num_rows: 10000 }) })

提供机构：

maas

创建时间：

2025-10-11

搜集汇总

数据集介绍