five

the-stack-smol

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/bigcode/the-stack-smol
下载链接
链接失效反馈
官方服务:
资源简介:
## Dataset Description ![Smol](https://huggingface.co/datasets/bigcode/admin/resolve/main/smol.png) A small subset (~0.1%) of [the-stack](https://huggingface.co/datasets/bigcode/the-stack) dataset, each programming language has 10,000 random samples from the original dataset. The dataset has 2.6GB of text (code). ## Languages The dataset contains 30 programming languages: ```` "assembly", "batchfile", "c++", "c", "c-sharp", "cmake", "css", "dockerfile", "fortran", "go", "haskell", "html", "java", "javascript", "julia", "lua", "makefile", "markdown", "perl", "php", "powershell", "python", "ruby", "rust", "scala", "shell", "sql", "tex", "typescript", "visual-basic" ````` ## Dataset Structure ```python from datasets import load_dataset load_dataset("bigcode/the-stack-smol") DatasetDict({ train: Dataset({ features: ['content', 'avg_line_length', 'max_line_length', 'alphanum_fraction', 'licenses', 'repository_name', 'path', 'size', 'lang'], num_rows: 300000 }) }) ``` ### How to use it You can either load the whole dataset like above, or load a specific language such as python by specifying the folder directory: ```python load_dataset("bigcode/the-stack-smol", data_dir="data/python") DatasetDict({ train: Dataset({ features: ['content', 'avg_line_length', 'max_line_length', 'alphanum_fraction', 'licenses', 'repository_name', 'path', 'size', 'lang'], num_rows: 10000 }) }) ```

## 数据集描述 ![Smol](https://huggingface.co/datasets/bigcode/admin/resolve/main/smol.png) 这是[the-stack](https://huggingface.co/datasets/bigcode/the-stack)数据集的一个小子集(约0.1%),每种编程语言均从原始数据集随机抽取10000个样本。该数据集包含2.6GB的文本(代码)。 ## 编程语言 该数据集包含30种编程语言: ` "assembly", "batchfile", "c++", "c", "c-sharp", "cmake", "css", "dockerfile", "fortran", "go", "haskell", "html", "java", "javascript", "julia", "lua", "makefile", "markdown", "perl", "php", "powershell", "python", "ruby", "rust", "scala", "shell", "sql", "tex", "typescript", "visual-basic" `` ## 数据集结构 python from datasets import load_dataset load_dataset("bigcode/the-stack-smol") DatasetDict({ train: Dataset({ features: ['content', 'avg_line_length', 'max_line_length', 'alphanum_fraction', 'licenses', 'repository_name', 'path', 'size', 'lang'], num_rows: 300000 }) }) ### 如何使用 您可以按照上述方式加载整个数据集,也可以通过指定文件夹目录加载特定语言(如Python): python load_dataset("bigcode/the-stack-smol", data_dir="data/python") DatasetDict({ train: Dataset({ features: ['content', 'avg_line_length', 'max_line_length', 'alphanum_fraction', 'licenses', 'repository_name', 'path', 'size', 'lang'], num_rows: 10000 }) })
提供机构:
maas
创建时间:
2025-10-11
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作