KStack-clean
收藏魔搭社区2025-12-05 更新2025-05-03 收录
下载链接:
https://modelscope.cn/datasets/JetBrains/KStack-clean
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Summary
The dataset contains 25,000 Kotlin code samples selected from the [KStack](https://huggingface.co/datasets/JetBrains/KStack) dataset. The selection is performed based on the value of the code for learning algorithmic concepts in Kotlin. In total, the dataset contains about 23M [CodeLlama-7b](https://huggingface.co/codellama/CodeLlama-7b-hf) tokens (vocab size 32,016).
## Column description
The dataset contains the following columns:
- `size` — size of the file in bytes
- `content` — text (content) of the file after removing personal identifiable information
- `repo_id` — GitHub ID of the repository
- `path` — path to a file
- `owner` — repo owner on GitHub
- `name` — repo name on GitHub
- `commit_sha` — hash of the commit, from which the revision of the file is taken
- `stars` — number of stars in the repo at the moment of collection
- `forks` — number of forks in the repo at the moment of collection
- `issues` — number of issues in the repo at the moment of collection
- `is_fork` — `true` if the repo is a fork or not as defined by GitHub
- `main_language` — main language of the repo as defined by GitHub
- `languages_distribution` — JSON with the distribution of languages by size in bytes in the repo
- `license` — permissive license of the repository
# Dataset Collection
The filtering from [KStack](https://huggingface.co/datasets/JetBrains/KStack) is performed using zero-shot quality estimation based on [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2). The model is prompted to determine which of two files has higher "educational value for learning algorithms in Kotlin". The results of the comparisons are averaged and used to train a binary classifier based on [CodeT5p-220m](https://huggingface.co/Salesforce/codet5p-220m). The binary classifier is then applied to the entire KStack to obtain scores for each sample in the dataset. The log-probability of the classifier prediction is used as a criterion of the selection.
# Opt-out
If you want your data to be removed from dataset, or have any other questions, please reach out to Sergey Titov: <sergey.titov@jetbrains.com>
## 数据集概览
该数据集从[KStack](https://huggingface.co/datasets/JetBrains/KStack)数据集中筛选出25000份Kotlin代码样本,筛选依据为代码在学习Kotlin算法概念方面的价值。本数据集总计包含约2300万个CodeLlama-7b(https://huggingface.co/codellama/CodeLlama-7b-hf)Token(Token),其词表大小为32016。
## 字段说明
该数据集包含以下字段:
- `size`:文件字节大小
- `content`:已移除个人可识别信息的文件文本内容
- `repo_id`:仓库的GitHub ID
- `path`:文件路径
- `owner`:GitHub仓库的所有者
- `name`:GitHub仓库名称
- `commit_sha`:对应文件版本的提交哈希值
- `stars`:数据集采集时刻该仓库的星标数量
- `forks`:数据集采集时刻该仓库的复刻数量
- `issues`:数据集采集时刻该仓库的议题数量
- `is_fork`:布尔值,标识该仓库是否为GitHub定义的复刻仓库
- `main_language`:GitHub定义的仓库主编程语言
- `languages_distribution`:以JSON格式存储的仓库语言字节占比分布
- `license`:该仓库采用的宽松开源许可证
## 数据集采集流程
本次筛选基于Mistral-7B-Instruct-v0.2(https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)实现的零样本(Zero-shot)质量评估,从原始KStack数据集中完成筛选。首先通过提示模型判断两份文件中哪一份具备更高的"Kotlin算法学习教育价值",将多组对比结果取平均后,用于训练基于CodeT5p-220m(https://huggingface.co/Salesforce/codet5p-220m)的二分类器。随后将该二分类器应用于完整的KStack数据集,为每一份样本生成评分,并以分类器预测的对数概率作为筛选依据。
## 撤销使用
若您希望将个人数据从本数据集中移除,或有其他相关疑问,请联系Sergey Titov:<sergey.titov@jetbrains.com>
提供机构:
maas
创建时间:
2025-04-30



