five

KStack

收藏
魔搭社区2025-12-05 更新2025-05-03 收录
下载链接:
https://modelscope.cn/datasets/JetBrains/KStack
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Summary KStack is the largest collection of permissively licensed Kotlin code. ![banner](https://huggingface.co/datasets/JetBrains/KStack/resolve/main/banner.png) ## Comparison with The Stack v2 In the table below one can find the comparsion between the Kotlin part of The Stack v2 and KStack: | | Files | Repositories | Lines | Tokens | |------------------------|:-----:|:------------:|:-----:|:------:| | Kotlin in The Stack v2 | 2M | 109,457 | 162M | 1.7B | | Kstack | 4M | 168,902 | 292M | 3.1B | # Dataset Creation ## Collection procedure We collected repositories from GitHub with the main language being Kotlin, as well as any repositories with Kotlin files that have received 10 or more stars (as of February 2024). Additionally, we gathered repositories with Kotlin files from The Stack v1.2. Kotlin files were identified using [go-enry](https://github.com/go-enry/go-enry) and include files with extensions such as `.kt`, `.kts`, and `.gradle.kts`. It is estimated that we have collected 97% of available Kotlin repositories as of February 2024. ## Initial filtering We conducted full deduplication, using the hash of file content, as well as near deduplication using the same method as in [The Stack v1.2](https://arxiv.org/pdf/2211.15533). We aggregated the files from one near-deduplicated cluster into a file from the repository with the most stars. ## Detecting permissive licenses We filtered permissive repositories based on the licenses detected by GitHub, and using [go-license-detector](https://github.com/src-d/go-license-detector) if GitHub did not have licensing information available. The list of permissive licenses used in dataset can be found [here](https://huggingface.co/datasets/JetBrains/KStack/blob/main/licenses.json). ## Personal and Sensitive Information To filter out personal information, we applied the same model that was used for The Stack v2 — [star-pii](https://arxiv.org/abs/2402.19173). # Column description The dataset contains the following columns: - `size` — size of the file in bytes - `content` — text (content) of the file after removing personal identifiable information - `repo_id` — GitHub ID of the repository - `path` — path to a file - `owner` — repo owner on GitHub - `name` — repo name on GitHub - `commit_sha` — hash of the commit, from which the revision of the file is taken - `stars` — number of stars in the repo at the moment of collection - `forks` — number of forks in the repo at the moment of collection - `issues` — number of issues in the repo at the moment of collection - `is_fork` — `true` if the repo is a fork or not as defined by GitHub - `main_language` — main language of the repo as defined by GitHub - `languages_distribution` — JSON with the distribution of languages by size in bytes in the repo - `license` — permissive license of the repository # Opt-out If you want your data to be removed from dataset, or have any other questions, please reach out to Sergey Titov: <sergey.titov@jetbrains.com>

# 数据集概述 KStack是目前规模最大的经过宽松开源许可的科特林(Kotlin)代码集合。 ![banner](https://huggingface.co/datasets/JetBrains/KStack/resolve/main/banner.png) # 与The Stack v2的对比 如下表格展示了The Stack v2中的科特林代码与KStack的对比情况: | | 文件数 | 仓库数 | 代码行数 | Token数 | |------------------------|:-----:|:------------:|:-----:|:------:| | The Stack v2中的科特林代码 | 2M | 109,457 | 162M | 1.7B | | KStack | 4M | 168,902 | 292M | 3.1B | # 数据集构建 ## 采集流程 我们从GitHub采集了主要语言为科特林(Kotlin)的仓库,以及截至2024年2月获得10星及以上的包含Kotlin文件的仓库。此外,我们还从The Stack v1.2中采集了包含Kotlin文件的仓库。Kotlin文件通过[go-enry](https://github.com/go-enry/go-enry)识别,涵盖扩展名为`.kt`、`.kts`和`.gradle.kts`的文件。据估算,截至2024年2月,我们已采集到97%的可用Kotlin仓库。 ## 初始去重 我们采用文件内容哈希进行全量去重,并采用与[The Stack v1.2](https://arxiv.org/pdf/2211.15533)相同的方法进行近似去重。我们将同一近似去重集群中的文件聚合到星数最多的仓库对应的文件中。 ## 宽松许可检测 我们基于GitHub检测到的许可对仓库进行过滤,若GitHub未提供许可信息,则使用[go-license-detector](https://github.com/src-d/go-license-detector)进行检测。本数据集使用的宽松许可列表可参见[此处](https://huggingface.co/datasets/JetBrains/KStack/blob/main/licenses.json)。 ## 个人与敏感信息处理 为过滤个人可识别信息,我们采用了与The Stack v2相同的模型——[star-pii](https://arxiv.org/abs/2402.19173)。 # 字段说明 本数据集包含以下字段: - `size` — 以字节为单位的文件大小 - `content` — 已移除个人可识别信息的文件文本内容 - `repo_id` — 仓库的GitHub ID - `path` — 文件路径 - `owner` — GitHub上的仓库所有者 - `name` — GitHub上的仓库名称 - `commit_sha` — 该文件版本对应的提交哈希值 - `stars` — 采集时刻该仓库获得的星标数 - `forks` — 采集时刻该仓库的复刻数 - `issues` — 采集时刻该仓库的议题数 - `is_fork` — 布尔值,若该仓库为GitHub定义的复刻仓库则为`true` - `main_language` — GitHub定义的仓库主要编程语言 - `languages_distribution` — 以JSON格式存储的仓库内各语言按字节占比的分布情况 - `license` — 该仓库采用的宽松许可 # 退出机制 若您希望将您的数据从本数据集中移除,或有任何其他疑问,请联系Sergey Titov: <sergey.titov@jetbrains.com>
提供机构:
maas
创建时间:
2025-04-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作