mvasiliniuc/iva-kotlin-codeint

Name: mvasiliniuc/iva-kotlin-codeint
Creator: mvasiliniuc
Published: 2023-06-16 06:56:58
License: 暂无描述

Hugging Face2023-06-16 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/mvasiliniuc/iva-kotlin-codeint

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - crowdsourced license: other language_creators: - crowdsourced language: - code task_categories: - text-generation tags: - code, kotlin, native Android development size_categories: - 100K<n<1M source_datasets: [] pretty_name: iva-kotlin-codeint-raw task_ids: - language-modeling --- # IVA Kotlin GitHub Code Dataset ## Dataset Description This is the raw IVA Kotlin dataset extracted from GitHub. It contains uncurated Kotlin files gathered with the purpose to train a code generation model. The dataset consists of 464215 kotlin code files from GitHub totaling ~361 MB of data. The dataset was created from the public GitHub dataset on Google BiqQuery. ### How to use it To download the full dataset: ```python from datasets import load_dataset dataset = load_dataset('mvasiliniuc/iva-kotlin-codeint', split='train') ``` ```python from datasets import load_dataset dataset = load_dataset('mvasiliniuc/iva-kotlin-codeint', split='train') print(dataset[723]) #OUTPUT: { "repo_name":"nemerosa/ontrack", "path":"ontrack-extension-notifications/src/main/java/net/nemerosa/ontrack/extension/notifications/webhooks/WebhookController.kt", "copies":"1", "size":"3248", "content":"...@RestController\n@RequestMapping(\"/extension/notifications/webhook\")\nclass WebhookController(\n private val webhookAdminService: WebhookAdminService,\n private val webhookExecutionService: ", "license":"mit" } ``` ## Data Structure ### Data Fields |Field|Type|Description| |---|---|---| |repo_name|string|name of the GitHub repository| |path|string|path of the file in GitHub repository| |copies|string|number of occurrences in dataset| |code|string|content of source file| |size|string|size of the source file in bytes| |license|string|license of GitHub repository| ### Instance ```json { "repo_name":"nemerosa/ontrack", "path":"ontrack-extension-notifications/src/main/java/net/nemerosa/ontrack/extension/notifications/webhooks/WebhookController.kt", "copies":"1", "size":"3248", "content":"...@RestController\n@RequestMapping(\"/extension/notifications/webhook\")\nclass WebhookController(\n private val webhookAdminService: WebhookAdminService,\n private val webhookExecutionService: ", "license":"mit" } ``` ## Languages The dataset contains only Kotlin files. ```json { "Kotlin": [".kt"] } ``` ## Licenses Each entry in the dataset contains the associated license. The following is a list of licenses involved and their occurrences. ```json { "agpl-3.0": 9146, "apache-2.0": 272388, "artistic-2.0": 219, "bsd-2-clause": 896, "bsd-3-clause": 12328, "cc0-1.0": 411, "epl-1.0": 2111, "gpl-2.0": 11080, "gpl-3.0": 48911, "isc": 997, "lgpl-2.1": 297, "lgpl-3.0": 7749, "mit": 92540, "mpl-2.0": 3386, "unlicense": 1756 } ``` ## Dataset Statistics ```json { "Total size": "~361 MB", "Number of files": 464215, "Number of files under 500 bytes": 99845, "Average file size in bytes": 3252, } ``` ## Dataset Creation The dataset was created using Google Query for Github: https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code The following steps were pursued for data gathering: 1. Creation of a dataset and a table in Google Big Query Project. 2. Creation of a bucket in Google Cloud Storage. 3. Creation of a query in Google Big Query Project. 4. Running the query with the setting to output the results in the dataset and table created at step one. 5. Exporting the resulting dataset into the bucket created in step 2. Export format of JSON with gzip compression. The result of these steps leads to the following results: * 2.7 TB Processed, * number of extracted rows/files was 464,215 * total logical bytes 1.46 GB. * the result amounts to 7 json.gz files in a total of 361 MB. The SQL Query used is: ```sql SELECT f.repo_name, f.path, c.copies, c.size, c.content, l.license FROM (select f.*, row_number() over (partition by id order by path desc) as seqnum from `bigquery-public-data.github_repos.files` AS f) f JOIN `bigquery-public-data.github_repos.contents` AS c ON f.id = c.id AND seqnum=1 JOIN `bigquery-public-data.github_repos.licenses` AS l ON f.repo_name = l.repo_name WHERE NOT c.binary AND ((f.path LIKE '%.kt') AND (c.size BETWEEN 0 AND 1048575)) ``` ## Data Splits The dataset only contains a train split. Using the curated version of this dataset, a split was made into multiple repositories: * Clean Version: https://huggingface.co/datasets/mvasiliniuc/iva-kotlin-codeint-clean * Clean Version Train: https://huggingface.co/datasets/mvasiliniuc/iva-kotlin-codeint-clean-train * Clean Version Valid: https://huggingface.co/datasets/mvasiliniuc/iva-kotlin-codeint-clean-valid # Considerations for Using the Data The dataset comprises source code from various repositories, potentially containing harmful or biased code, along with sensitive information such as passwords or usernames. # Additional Information ## Dataset Curators [mircea.dev@icloud.com](mircea.dev@icloud.com) ## Licensing Information * The license of this open-source dataset is: other. * The dataset is gathered from open-source repositories on [GitHub using BigQuery](https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code). * Find the license of each entry in the dataset in the corresponding license column. ## Citation Information ```json @misc {mircea_vasiliniuc_2023, author = { {Mircea Vasiliniuc} }, title = { iva-kotlin-codeint (Revision 1af5124) }, year = 2023, url = { https://huggingface.co/datasets/mvasiliniuc/iva-kotlin-codeint }, doi = { 10.57967/hf/0779 }, publisher = { Hugging Face } } ```

提供机构：

mvasiliniuc

原始信息汇总

数据集概述

数据集名称

名称：IVA Kotlin GitHub Code Dataset
别名：iva-kotlin-codeint-raw

数据集描述

来源：从GitHub提取的未经过滤的Kotlin代码文件。
目的：用于训练代码生成模型。
包含：464,215个Kotlin代码文件，总数据量约361 MB。

数据集特征

语言：仅包含Kotlin文件。
任务类别：文本生成。
标签：code, kotlin, native Android development。
大小类别：100K<n<1M。

数据结构

数据字段：
- repo_name: GitHub仓库名称
- path: 文件在GitHub仓库中的路径
- copies: 数据集中出现的次数
- size: 源文件大小（字节）
- content: 源文件内容
- license: GitHub仓库的许可证

数据集统计

总大小：约361 MB
文件数量：464,215
平均文件大小：3252字节

许可证

数据集许可证：other
每个文件的许可证信息包含在数据中。

使用方法

下载数据集的Python代码示例： python from datasets import load_dataset dataset = load_dataset(mvasiliniuc/iva-kotlin-codeint, split=train)

数据集创建

使用Google BigQuery从GitHub数据集中提取。
处理步骤包括在Google BigQuery中创建数据集和表，使用特定SQL查询提取数据，并将结果导出到Google Cloud Storage。

数据分割

数据集仅包含训练分割。

注意事项

数据集可能包含有害或偏见的代码，以及敏感信息如密码或用户名。

许可证信息

数据集的许可证为“other”。
数据集中的每个文件都包含其对应的许可证信息。

引用信息

json @misc {mircea_vasiliniuc_2023, author = { {Mircea Vasiliniuc} }, title = { iva-kotlin-codeint (Revision 1af5124) }, year = 2023, url = { https://huggingface.co/datasets/mvasiliniuc/iva-kotlin-codeint }, doi = { 10.57967/hf/0779 }, publisher = { Hugging Face } }

5,000+

优质数据集

54 个

任务类型

进入经典数据集