KStack-clean

Name: KStack-clean
Creator: maas
Published: 2025-12-05 16:32:59
License: 暂无描述

魔搭社区2025-12-05 更新2025-05-03 收录

下载链接：

https://modelscope.cn/datasets/JetBrains/KStack-clean

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Summary The dataset contains 25,000 Kotlin code samples selected from the [KStack](https://huggingface.co/datasets/JetBrains/KStack) dataset. The selection is performed based on the value of the code for learning algorithmic concepts in Kotlin. In total, the dataset contains about 23M [CodeLlama-7b](https://huggingface.co/codellama/CodeLlama-7b-hf) tokens (vocab size 32,016). ## Column description The dataset contains the following columns: - `size` — size of the file in bytes - `content` — text (content) of the file after removing personal identifiable information - `repo_id` — GitHub ID of the repository - `path` — path to a file - `owner` — repo owner on GitHub - `name` — repo name on GitHub - `commit_sha` — hash of the commit, from which the revision of the file is taken - `stars` — number of stars in the repo at the moment of collection - `forks` — number of forks in the repo at the moment of collection - `issues` — number of issues in the repo at the moment of collection - `is_fork` — `true` if the repo is a fork or not as defined by GitHub - `main_language` — main language of the repo as defined by GitHub - `languages_distribution` — JSON with the distribution of languages by size in bytes in the repo - `license` — permissive license of the repository # Dataset Collection The filtering from [KStack](https://huggingface.co/datasets/JetBrains/KStack) is performed using zero-shot quality estimation based on [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2). The model is prompted to determine which of two files has higher "educational value for learning algorithms in Kotlin". The results of the comparisons are averaged and used to train a binary classifier based on [CodeT5p-220m](https://huggingface.co/Salesforce/codet5p-220m). The binary classifier is then applied to the entire KStack to obtain scores for each sample in the dataset. The log-probability of the classifier prediction is used as a criterion of the selection. # Opt-out If you want your data to be removed from dataset, or have any other questions, please reach out to Sergey Titov: <sergey.titov@jetbrains.com>

## 数据集概览该数据集从[KStack](https://huggingface.co/datasets/JetBrains/KStack)数据集中筛选出25000份Kotlin代码样本，筛选依据为代码在学习Kotlin算法概念方面的价值。本数据集总计包含约2300万个CodeLlama-7b（https://huggingface.co/codellama/CodeLlama-7b-hf）Token（Token），其词表大小为32016。 ## 字段说明该数据集包含以下字段： - `size`：文件字节大小 - `content`：已移除个人可识别信息的文件文本内容 - `repo_id`：仓库的GitHub ID - `path`：文件路径 - `owner`：GitHub仓库的所有者 - `name`：GitHub仓库名称 - `commit_sha`：对应文件版本的提交哈希值 - `stars`：数据集采集时刻该仓库的星标数量 - `forks`：数据集采集时刻该仓库的复刻数量 - `issues`：数据集采集时刻该仓库的议题数量 - `is_fork`：布尔值，标识该仓库是否为GitHub定义的复刻仓库 - `main_language`：GitHub定义的仓库主编程语言 - `languages_distribution`：以JSON格式存储的仓库语言字节占比分布 - `license`：该仓库采用的宽松开源许可证 ## 数据集采集流程本次筛选基于Mistral-7B-Instruct-v0.2（https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2）实现的零样本（Zero-shot）质量评估，从原始KStack数据集中完成筛选。首先通过提示模型判断两份文件中哪一份具备更高的"Kotlin算法学习教育价值"，将多组对比结果取平均后，用于训练基于CodeT5p-220m（https://huggingface.co/Salesforce/codet5p-220m）的二分类器。随后将该二分类器应用于完整的KStack数据集，为每一份样本生成评分，并以分类器预测的对数概率作为筛选依据。 ## 撤销使用若您希望将个人数据从本数据集中移除，或有其他相关疑问，请联系Sergey Titov：<sergey.titov@jetbrains.com>

提供机构：

maas

创建时间：

2025-04-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集