mvasiliniuc/iva-kotlin-codeint
收藏Hugging Face2023-06-16 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/mvasiliniuc/iva-kotlin-codeint
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- crowdsourced
license: other
language_creators:
- crowdsourced
language:
- code
task_categories:
- text-generation
tags:
- code, kotlin, native Android development
size_categories:
- 100K<n<1M
source_datasets: []
pretty_name: iva-kotlin-codeint-raw
task_ids:
- language-modeling
---
# IVA Kotlin GitHub Code Dataset
## Dataset Description
This is the raw IVA Kotlin dataset extracted from GitHub.
It contains uncurated Kotlin files gathered with the purpose to train a code generation model.
The dataset consists of 464215 kotlin code files from GitHub totaling ~361 MB of data.
The dataset was created from the public GitHub dataset on Google BiqQuery.
### How to use it
To download the full dataset:
```python
from datasets import load_dataset
dataset = load_dataset('mvasiliniuc/iva-kotlin-codeint', split='train')
```
```python
from datasets import load_dataset
dataset = load_dataset('mvasiliniuc/iva-kotlin-codeint', split='train')
print(dataset[723])
#OUTPUT:
{
"repo_name":"nemerosa/ontrack",
"path":"ontrack-extension-notifications/src/main/java/net/nemerosa/ontrack/extension/notifications/webhooks/WebhookController.kt",
"copies":"1",
"size":"3248",
"content":"...@RestController\n@RequestMapping(\"/extension/notifications/webhook\")\nclass WebhookController(\n private val webhookAdminService: WebhookAdminService,\n private val webhookExecutionService: ",
"license":"mit"
}
```
## Data Structure
### Data Fields
|Field|Type|Description|
|---|---|---|
|repo_name|string|name of the GitHub repository|
|path|string|path of the file in GitHub repository|
|copies|string|number of occurrences in dataset|
|code|string|content of source file|
|size|string|size of the source file in bytes|
|license|string|license of GitHub repository|
### Instance
```json
{
"repo_name":"nemerosa/ontrack",
"path":"ontrack-extension-notifications/src/main/java/net/nemerosa/ontrack/extension/notifications/webhooks/WebhookController.kt",
"copies":"1",
"size":"3248",
"content":"...@RestController\n@RequestMapping(\"/extension/notifications/webhook\")\nclass WebhookController(\n private val webhookAdminService: WebhookAdminService,\n private val webhookExecutionService: ",
"license":"mit"
}
```
## Languages
The dataset contains only Kotlin files.
```json
{
"Kotlin": [".kt"]
}
```
## Licenses
Each entry in the dataset contains the associated license. The following is a list of licenses involved and their occurrences.
```json
{
"agpl-3.0": 9146,
"apache-2.0": 272388,
"artistic-2.0": 219,
"bsd-2-clause": 896,
"bsd-3-clause": 12328,
"cc0-1.0": 411,
"epl-1.0": 2111,
"gpl-2.0": 11080,
"gpl-3.0": 48911,
"isc": 997,
"lgpl-2.1": 297,
"lgpl-3.0": 7749,
"mit": 92540,
"mpl-2.0": 3386,
"unlicense": 1756
}
```
## Dataset Statistics
```json
{
"Total size": "~361 MB",
"Number of files": 464215,
"Number of files under 500 bytes": 99845,
"Average file size in bytes": 3252,
}
```
## Dataset Creation
The dataset was created using Google Query for Github:
https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code
The following steps were pursued for data
gathering:
1. Creation of a dataset and a table in Google Big Query Project.
2. Creation of a bucket in Google Cloud Storage.
3. Creation of a query in Google Big Query Project.
4. Running the query with the setting to output the results in the dataset and table
created at step one.
5. Exporting the resulting dataset into the bucket created in step 2. Export format of JSON with gzip compression.
The result of these steps leads to the following results:
* 2.7 TB Processed,
* number of extracted rows/files was 464,215
* total logical bytes 1.46 GB.
* the result amounts to 7 json.gz files in a total of 361 MB.
The SQL Query used is:
```sql
SELECT
f.repo_name, f.path, c.copies, c.size, c.content, l.license
FROM
(select f.*, row_number() over (partition by id order by path desc) as seqnum from `bigquery-public-data.github_repos.files` AS f) f
JOIN
`bigquery-public-data.github_repos.contents` AS c
ON
f.id = c.id AND seqnum=1
JOIN
`bigquery-public-data.github_repos.licenses` AS l
ON
f.repo_name = l.repo_name
WHERE
NOT c.binary AND ((f.path LIKE '%.kt') AND (c.size BETWEEN 0 AND 1048575))
```
## Data Splits
The dataset only contains a train split.
Using the curated version of this dataset, a split was made into multiple repositories:
* Clean Version: https://huggingface.co/datasets/mvasiliniuc/iva-kotlin-codeint-clean
* Clean Version Train: https://huggingface.co/datasets/mvasiliniuc/iva-kotlin-codeint-clean-train
* Clean Version Valid: https://huggingface.co/datasets/mvasiliniuc/iva-kotlin-codeint-clean-valid
# Considerations for Using the Data
The dataset comprises source code from various repositories, potentially containing harmful or biased code,
along with sensitive information such as passwords or usernames.
# Additional Information
## Dataset Curators
[mircea.dev@icloud.com](mircea.dev@icloud.com)
## Licensing Information
* The license of this open-source dataset is: other.
* The dataset is gathered from open-source repositories on [GitHub using BigQuery](https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code).
* Find the license of each entry in the dataset in the corresponding license column.
## Citation Information
```json
@misc {mircea_vasiliniuc_2023,
author = { {Mircea Vasiliniuc} },
title = { iva-kotlin-codeint (Revision 1af5124) },
year = 2023,
url = { https://huggingface.co/datasets/mvasiliniuc/iva-kotlin-codeint },
doi = { 10.57967/hf/0779 },
publisher = { Hugging Face }
}
```
提供机构:
mvasiliniuc
原始信息汇总
数据集概述
数据集名称
- 名称:IVA Kotlin GitHub Code Dataset
- 别名:iva-kotlin-codeint-raw
数据集描述
- 来源:从GitHub提取的未经过滤的Kotlin代码文件。
- 目的:用于训练代码生成模型。
- 包含:464,215个Kotlin代码文件,总数据量约361 MB。
数据集特征
- 语言:仅包含Kotlin文件。
- 任务类别:文本生成。
- 标签:code, kotlin, native Android development。
- 大小类别:100K<n<1M。
数据结构
- 数据字段:
- repo_name: GitHub仓库名称
- path: 文件在GitHub仓库中的路径
- copies: 数据集中出现的次数
- size: 源文件大小(字节)
- content: 源文件内容
- license: GitHub仓库的许可证
数据集统计
- 总大小:约361 MB
- 文件数量:464,215
- 平均文件大小:3252字节
许可证
- 数据集许可证:other
- 每个文件的许可证信息包含在数据中。
使用方法
- 下载数据集的Python代码示例: python from datasets import load_dataset dataset = load_dataset(mvasiliniuc/iva-kotlin-codeint, split=train)
数据集创建
- 使用Google BigQuery从GitHub数据集中提取。
- 处理步骤包括在Google BigQuery中创建数据集和表,使用特定SQL查询提取数据,并将结果导出到Google Cloud Storage。
数据分割
- 数据集仅包含训练分割。
注意事项
- 数据集可能包含有害或偏见的代码,以及敏感信息如密码或用户名。
许可证信息
- 数据集的许可证为“other”。
- 数据集中的每个文件都包含其对应的许可证信息。
引用信息
json @misc {mircea_vasiliniuc_2023, author = { {Mircea Vasiliniuc} }, title = { iva-kotlin-codeint (Revision 1af5124) }, year = 2023, url = { https://huggingface.co/datasets/mvasiliniuc/iva-kotlin-codeint }, doi = { 10.57967/hf/0779 }, publisher = { Hugging Face } }



