swallow-code-v0.1
收藏魔搭社区2025-11-27 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/tokyotech-llm/swallow-code-v0.1
下载链接
链接失效反馈官方服务:
资源简介:
## What is it?
Swallow-code-v0.1 consists of 4 staged dataset subsets and are filtered from [bigcode/the-stack-v2-train-smol-ids](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids).
## What is being released?
The dataset is released in four versions:
- **Swallow Code v0.1 stage 1**: 36B tokens, 41M documents containing Python scripts.
- **Swallow Code v0.1 stage 2**: 31B tokens, 37M documents containing Python scripts that are syntax error-free.
- **Swallow Code v0.1 stage 3**: 20B tokens, 24M documents containing Python scripts that are filtered with pylint score.
- **Swallow Code v0.1 stage 4**: 16B tokens, 21M documents containing Python scripts that are filtered with code comments and literal language detection(English and Japanese).
## Results and Performance
Llama-3.1-8B Performance after Continual Pretraining on 50B tokens Japanese, English, and Code(= swallow-code-v0.1) datasets.

## Dataset Schema
```python
{
"blob_id": string,
"path": string,
"content_id": string,
"language": string,
"length_bytes": int64,
"detected_licenses": list,
"license_type": string,
"src_encoding": string,
"is_vendor": bool,
"is_generated": bool,
"alphanum_fraction": float64,
"alpha_fraction": float64,
"num_lines": int64,
"avg_line_length": float64,
"max_line_length": int64,
"text": string,
"analysis_results": list,
"has_issues": bool,
"language_type_issue": list,
"language_type": string,
"pylint_score": int64,
"pylint_output": string
}
```
## Licensing information
Swallow-code-v0.1 follows the license of the stack v2. The following is the license of the stack v2.
The Stack v2 is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in The Stack v2 must abide by the terms of the original licenses, including attribution clauses when relevant. We facilitate this by providing provenance information for each data point.
## Citation information
```
@misc{fujii2024swallowcode,
author = { Kazuki Fujii, Rio Yokota },
title = { Swallow-Code-v0.1 },
year = 2024,
url = { https://huggingface.co/datasets/tokyotech-llm/swallow-code-v0.1 },
publisher = { Swallow Project }
}
```
## 数据集概述
Swallow-code-v0.1 包含4个分阶段数据集子集,均从[bigcode/the-stack-v2-train-smol-ids](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids)数据集筛选得到。
## 发布版本详情
本次共发布四个版本的数据集:
- **Swallow Code v0.1 阶段1**:包含360亿Token(Token)、4100万个文档,均为Python脚本。
- **Swallow Code v0.1 阶段2**:包含310亿Token、3700万个文档,均为无语法错误的Python脚本。
- **Swallow Code v0.1 阶段3**:包含200亿Token、2400万个文档,均为通过pylint评分筛选的Python脚本。
- **Swallow Code v0.1 阶段4**:包含160亿Token、2100万个文档,均为经过代码注释与文本语言检测(仅保留英语和日语)筛选的Python脚本。
## 实验结果与性能表现
在500亿Token的日语、英语与代码(即swallow-code-v0.1)数据集上进行持续预训练后,Llama-3.1-8B的性能表现。

## 数据集结构
python
{
"blob_id": 字符串,
"path": 字符串,
"content_id": 字符串,
"language": 字符串,
"length_bytes": 64位整数,
"detected_licenses": 列表,
"license_type": 字符串,
"src_encoding": 字符串,
"is_vendor": 布尔值,
"is_generated": 布尔值,
"alphanum_fraction": 64位浮点数,
"alpha_fraction": 64位浮点数,
"num_lines": 64位整数,
"avg_line_length": 64位浮点数,
"max_line_length": 64位整数,
"text": 字符串,
"analysis_results": 列表,
"has_issues": 布尔值,
"language_type_issue": 列表,
"language_type": 字符串,
"pylint_score": 64位整数,
"pylint_output": 字符串
}
## 授权协议说明
Swallow-code-v0.1 遵循The Stack v2的授权协议,The Stack v2的授权协议说明如下。
The Stack v2 是包含多种授权协议仓库的源代码集合。对The Stack v2中收集的全部或部分代码的任何使用,必须遵守原始授权协议的条款,若相关则需包含署名要求。我们通过为每个数据点提供来源溯源信息,以便利用户遵循该要求。
## 引用规范
@misc{fujii2024swallowcode,
author = { Kazuki Fujii, Rio Yokota },
title = { Swallow-Code-v0.1 },
year = 2024,
url = { https://huggingface.co/datasets/tokyotech-llm/swallow-code-v0.1 },
publisher = { Swallow Project }
}
提供机构:
maas
创建时间:
2025-10-12



