mvasiliniuc/iva-kotlin-codeint-clean-train

Name: mvasiliniuc/iva-kotlin-codeint-clean-train
Creator: mvasiliniuc
Published: 2023-06-15 14:49:17
License: 暂无描述

Hugging Face2023-06-15 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/mvasiliniuc/iva-kotlin-codeint-clean-train

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - crowdsourced license: other language_creators: - crowdsourced language: - code task_categories: - text-generation tags: - code, kotlin, native Android development, curated, training size_categories: - 100K<n<1M source_datasets: [] pretty_name: iva-kotlin-codeint-clean task_ids: - language-modeling --- # IVA Kotlin GitHub Code Dataset ## Dataset Description This is the curated train split of IVA Kotlin dataset extracted from GitHub. It contains curated Kotlin files gathered with the purpose to train a code generation model. The dataset consists of 383380 Kotlin code files from GitHub. [Here is the unsliced curated dataset](https://huggingface.co/datasets/mvasiliniuc/iva-kotlin-codeint-clean) and [here is the raw dataset](https://huggingface.co/datasets/mvasiliniuc/iva-kotlin-codeint). ### How to use it To download the full dataset: ```python from datasets import load_dataset dataset = load_dataset('mvasiliniuc/iva-kotlin-codeint-clean-train', split='train')) ``` ## Data Structure ### Data Fields |Field|Type|Description| |---|---|---| |repo_name|string|name of the GitHub repository| |path|string|path of the file in GitHub repository| |copies|string|number of occurrences in dataset| |content|string|content of source file| |size|string|size of the source file in bytes| |license|string|license of GitHub repository| |hash|string|Hash of content field.| |line_mean|number|Mean line length of the content. |line_max|number|Max line length of the content. |alpha_frac|number|Fraction between mean and max line length of content. |ratio|number|Character/token ratio of the file with tokenizer. |autogenerated|boolean|True if the content is autogenerated by looking for keywords in the first few lines of the file. |config_or_test|boolean|True if the content is a configuration file or a unit test. |has_no_keywords|boolean|True if a file has none of the keywords for Kotlin Programming Language. |has_few_assignments|boolean|True if file uses symbol '=' less than `minimum` times. ### Instance ```json { "repo_name":"oboenikui/UnivCoopFeliCaReader", "path":"app/src/main/java/com/oboenikui/campusfelica/ScannerActivity.kt", "copies":"1", "size":"5635", "content":"....", "license":"apache-2.0", "hash":"e88cfd99346cbef640fc540aac3bf20b", "line_mean":37.8620689655, "line_max":199, "alpha_frac":0.5724933452, "ratio":5.0222816399, "autogenerated":false, "config_or_test":false, "has_no_keywords":false, "has_few_assignments":false } ``` ## Languages The dataset contains only Kotlin files. ```json { "Kotlin": [".kt"] } ``` ## Licenses Each entry in the dataset contains the associated license. The following is a list of licenses involved and their occurrences. ```json { "agpl-3.0":3209, "apache-2.0":90782, "artistic-2.0":130, "bsd-2-clause":380, "bsd-3-clause":3584, "cc0-1.0":155, "epl-1.0":792, "gpl-2.0":4432, "gpl-3.0":19816, "isc":345, "lgpl-2.1":118, "lgpl-3.0":2689, "mit":31470, "mpl-2.0":1444, "unlicense":654 } ``` ## Dataset Statistics ```json { "Total size": "~207 MB", "Number of files": 160000, "Number of files under 500 bytes": 2957, "Average file size in bytes": 5199, } ``` ## Curation Process See [the unsliced curated dataset](https://huggingface.co/datasets/mvasiliniuc/iva-kotlin-codeint-clean) for mode details. ## Data Splits The dataset only contains a train split focused only on training data. For validation and unspliced versions, please check the following links: * Clean Version Unsliced: https://huggingface.co/datasets/mvasiliniuc/iva-kotlin-codeint-clean * Clean Version Valid: https://huggingface.co/datasets/mvasiliniuc/iva-kotlin-codeint-clean-valid # Considerations for Using the Data The dataset comprises source code from various repositories, potentially containing harmful or biased code, along with sensitive information such as passwords or usernames.

提供机构：

mvasiliniuc

原始信息汇总

数据集概述

数据集名称

iva-kotlin-codeint-clean

数据集描述

该数据集是从GitHub提取的IVA Kotlin数据集的精选训练部分，包含383,380个精选的Kotlin代码文件，旨在用于训练代码生成模型。

数据集使用方法

通过以下代码下载完整数据集： python from datasets import load_dataset dataset = load_dataset(mvasiliniuc/iva-kotlin-codeint-clean-train, split=train)

数据结构

数据字段：
- repo_name: 字符串，GitHub仓库名称
- path: 字符串，文件在GitHub仓库中的路径
- copies: 字符串，数据集中出现的次数
- content: 字符串，源文件内容
- size: 字符串，源文件大小（字节）
- license: 字符串，GitHub仓库的许可证
- hash: 字符串，内容字段的哈希值
- line_mean: 数字，内容的平均行长度
- line_max: 数字，内容的最大行长度
- alpha_frac: 数字，内容平均和最大行长度之间的比例
- ratio: 数字，文件的字符/标记比例
- autogenerated: 布尔值，如果内容是通过查找文件前几行的关键字自动生成的，则为True
- config_or_test: 布尔值，如果内容是配置文件或单元测试，则为True
- has_no_keywords: 布尔值，如果文件没有Kotlin编程语言的关键字，则为True
- has_few_assignments: 布尔值，如果文件使用符号=少于minimum次，则为True

语言

数据集仅包含Kotlin文件。

许可证

数据集中的每个条目都包含相关的许可证。涉及的许可证及其出现次数如下：
- agpl-3.0: 3209次
- apache-2.0: 90782次
- artistic-2.0: 130次
- bsd-2-clause: 380次
- bsd-3-clause: 3584次
- cc0-1.0: 155次
- epl-1.0: 792次
- gpl-2.0: 4432次
- gpl-3.0: 19816次
- isc: 345次
- lgpl-2.1: 118次
- lgpl-3.0: 2689次
- mit: 31470次
- mpl-2.0: 1444次
- unlicense: 654次

数据集统计

总大小：约207 MB
文件数量：160,000个
小于500字节的文件数量：2957个
平均文件大小：5199字节

数据分割

数据集仅包含训练分割，专注于训练数据。验证和未分割版本请参考提供的链接。

使用数据注意事项

数据集包含来自各种仓库的源代码，可能包含有害或带有偏见的代码，以及敏感信息如密码或用户名。

5,000+

优质数据集

54 个

任务类型

进入经典数据集