five

KStack-clean

收藏
魔搭社区2025-12-05 更新2025-05-03 收录
下载链接:
https://modelscope.cn/datasets/JetBrains/KStack-clean
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Summary The dataset contains 25,000 Kotlin code samples selected from the [KStack](https://huggingface.co/datasets/JetBrains/KStack) dataset. The selection is performed based on the value of the code for learning algorithmic concepts in Kotlin. In total, the dataset contains about 23M [CodeLlama-7b](https://huggingface.co/codellama/CodeLlama-7b-hf) tokens (vocab size 32,016). ## Column description The dataset contains the following columns: - `size` — size of the file in bytes - `content` — text (content) of the file after removing personal identifiable information - `repo_id` — GitHub ID of the repository - `path` — path to a file - `owner` — repo owner on GitHub - `name` — repo name on GitHub - `commit_sha` — hash of the commit, from which the revision of the file is taken - `stars` — number of stars in the repo at the moment of collection - `forks` — number of forks in the repo at the moment of collection - `issues` — number of issues in the repo at the moment of collection - `is_fork` — `true` if the repo is a fork or not as defined by GitHub - `main_language` — main language of the repo as defined by GitHub - `languages_distribution` — JSON with the distribution of languages by size in bytes in the repo - `license` — permissive license of the repository # Dataset Collection The filtering from [KStack](https://huggingface.co/datasets/JetBrains/KStack) is performed using zero-shot quality estimation based on [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2). The model is prompted to determine which of two files has higher "educational value for learning algorithms in Kotlin". The results of the comparisons are averaged and used to train a binary classifier based on [CodeT5p-220m](https://huggingface.co/Salesforce/codet5p-220m). The binary classifier is then applied to the entire KStack to obtain scores for each sample in the dataset. The log-probability of the classifier prediction is used as a criterion of the selection. # Opt-out If you want your data to be removed from dataset, or have any other questions, please reach out to Sergey Titov: <sergey.titov@jetbrains.com>

## 数据集概览 该数据集从[KStack](https://huggingface.co/datasets/JetBrains/KStack)数据集中筛选出25000份Kotlin代码样本,筛选依据为代码在学习Kotlin算法概念方面的价值。本数据集总计包含约2300万个CodeLlama-7b(https://huggingface.co/codellama/CodeLlama-7b-hf)Token(Token),其词表大小为32016。 ## 字段说明 该数据集包含以下字段: - `size`:文件字节大小 - `content`:已移除个人可识别信息的文件文本内容 - `repo_id`:仓库的GitHub ID - `path`:文件路径 - `owner`:GitHub仓库的所有者 - `name`:GitHub仓库名称 - `commit_sha`:对应文件版本的提交哈希值 - `stars`:数据集采集时刻该仓库的星标数量 - `forks`:数据集采集时刻该仓库的复刻数量 - `issues`:数据集采集时刻该仓库的议题数量 - `is_fork`:布尔值,标识该仓库是否为GitHub定义的复刻仓库 - `main_language`:GitHub定义的仓库主编程语言 - `languages_distribution`:以JSON格式存储的仓库语言字节占比分布 - `license`:该仓库采用的宽松开源许可证 ## 数据集采集流程 本次筛选基于Mistral-7B-Instruct-v0.2(https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)实现的零样本(Zero-shot)质量评估,从原始KStack数据集中完成筛选。首先通过提示模型判断两份文件中哪一份具备更高的"Kotlin算法学习教育价值",将多组对比结果取平均后,用于训练基于CodeT5p-220m(https://huggingface.co/Salesforce/codet5p-220m)的二分类器。随后将该二分类器应用于完整的KStack数据集,为每一份样本生成评分,并以分类器预测的对数概率作为筛选依据。 ## 撤销使用 若您希望将个人数据从本数据集中移除,或有其他相关疑问,请联系Sergey Titov:<sergey.titov@jetbrains.com>
提供机构:
maas
创建时间:
2025-04-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作