five

mvasiliniuc/iva-kotlin-codeint-clean-train

收藏
Hugging Face2023-06-15 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/mvasiliniuc/iva-kotlin-codeint-clean-train
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - crowdsourced license: other language_creators: - crowdsourced language: - code task_categories: - text-generation tags: - code, kotlin, native Android development, curated, training size_categories: - 100K<n<1M source_datasets: [] pretty_name: iva-kotlin-codeint-clean task_ids: - language-modeling --- # IVA Kotlin GitHub Code Dataset ## Dataset Description This is the curated train split of IVA Kotlin dataset extracted from GitHub. It contains curated Kotlin files gathered with the purpose to train a code generation model. The dataset consists of 383380 Kotlin code files from GitHub. [Here is the unsliced curated dataset](https://huggingface.co/datasets/mvasiliniuc/iva-kotlin-codeint-clean) and [here is the raw dataset](https://huggingface.co/datasets/mvasiliniuc/iva-kotlin-codeint). ### How to use it To download the full dataset: ```python from datasets import load_dataset dataset = load_dataset('mvasiliniuc/iva-kotlin-codeint-clean-train', split='train')) ``` ## Data Structure ### Data Fields |Field|Type|Description| |---|---|---| |repo_name|string|name of the GitHub repository| |path|string|path of the file in GitHub repository| |copies|string|number of occurrences in dataset| |content|string|content of source file| |size|string|size of the source file in bytes| |license|string|license of GitHub repository| |hash|string|Hash of content field.| |line_mean|number|Mean line length of the content. |line_max|number|Max line length of the content. |alpha_frac|number|Fraction between mean and max line length of content. |ratio|number|Character/token ratio of the file with tokenizer. |autogenerated|boolean|True if the content is autogenerated by looking for keywords in the first few lines of the file. |config_or_test|boolean|True if the content is a configuration file or a unit test. |has_no_keywords|boolean|True if a file has none of the keywords for Kotlin Programming Language. |has_few_assignments|boolean|True if file uses symbol '=' less than `minimum` times. ### Instance ```json { "repo_name":"oboenikui/UnivCoopFeliCaReader", "path":"app/src/main/java/com/oboenikui/campusfelica/ScannerActivity.kt", "copies":"1", "size":"5635", "content":"....", "license":"apache-2.0", "hash":"e88cfd99346cbef640fc540aac3bf20b", "line_mean":37.8620689655, "line_max":199, "alpha_frac":0.5724933452, "ratio":5.0222816399, "autogenerated":false, "config_or_test":false, "has_no_keywords":false, "has_few_assignments":false } ``` ## Languages The dataset contains only Kotlin files. ```json { "Kotlin": [".kt"] } ``` ## Licenses Each entry in the dataset contains the associated license. The following is a list of licenses involved and their occurrences. ```json { "agpl-3.0":3209, "apache-2.0":90782, "artistic-2.0":130, "bsd-2-clause":380, "bsd-3-clause":3584, "cc0-1.0":155, "epl-1.0":792, "gpl-2.0":4432, "gpl-3.0":19816, "isc":345, "lgpl-2.1":118, "lgpl-3.0":2689, "mit":31470, "mpl-2.0":1444, "unlicense":654 } ``` ## Dataset Statistics ```json { "Total size": "~207 MB", "Number of files": 160000, "Number of files under 500 bytes": 2957, "Average file size in bytes": 5199, } ``` ## Curation Process See [the unsliced curated dataset](https://huggingface.co/datasets/mvasiliniuc/iva-kotlin-codeint-clean) for mode details. ## Data Splits The dataset only contains a train split focused only on training data. For validation and unspliced versions, please check the following links: * Clean Version Unsliced: https://huggingface.co/datasets/mvasiliniuc/iva-kotlin-codeint-clean * Clean Version Valid: https://huggingface.co/datasets/mvasiliniuc/iva-kotlin-codeint-clean-valid # Considerations for Using the Data The dataset comprises source code from various repositories, potentially containing harmful or biased code, along with sensitive information such as passwords or usernames.
提供机构:
mvasiliniuc
原始信息汇总

数据集概述

数据集名称

  • iva-kotlin-codeint-clean

数据集描述

  • 该数据集是从GitHub提取的IVA Kotlin数据集的精选训练部分,包含383,380个精选的Kotlin代码文件,旨在用于训练代码生成模型。

数据集使用方法

  • 通过以下代码下载完整数据集: python from datasets import load_dataset dataset = load_dataset(mvasiliniuc/iva-kotlin-codeint-clean-train, split=train)

数据结构

  • 数据字段
    • repo_name: 字符串,GitHub仓库名称
    • path: 字符串,文件在GitHub仓库中的路径
    • copies: 字符串,数据集中出现的次数
    • content: 字符串,源文件内容
    • size: 字符串,源文件大小(字节)
    • license: 字符串,GitHub仓库的许可证
    • hash: 字符串,内容字段的哈希值
    • line_mean: 数字,内容的平均行长度
    • line_max: 数字,内容的最大行长度
    • alpha_frac: 数字,内容平均和最大行长度之间的比例
    • ratio: 数字,文件的字符/标记比例
    • autogenerated: 布尔值,如果内容是通过查找文件前几行的关键字自动生成的,则为True
    • config_or_test: 布尔值,如果内容是配置文件或单元测试,则为True
    • has_no_keywords: 布尔值,如果文件没有Kotlin编程语言的关键字,则为True
    • has_few_assignments: 布尔值,如果文件使用符号=少于minimum次,则为True

语言

  • 数据集仅包含Kotlin文件。

许可证

  • 数据集中的每个条目都包含相关的许可证。涉及的许可证及其出现次数如下:
    • agpl-3.0: 3209次
    • apache-2.0: 90782次
    • artistic-2.0: 130次
    • bsd-2-clause: 380次
    • bsd-3-clause: 3584次
    • cc0-1.0: 155次
    • epl-1.0: 792次
    • gpl-2.0: 4432次
    • gpl-3.0: 19816次
    • isc: 345次
    • lgpl-2.1: 118次
    • lgpl-3.0: 2689次
    • mit: 31470次
    • mpl-2.0: 1444次
    • unlicense: 654次

数据集统计

  • 总大小:约207 MB
  • 文件数量:160,000个
  • 小于500字节的文件数量:2957个
  • 平均文件大小:5199字节

数据分割

  • 数据集仅包含训练分割,专注于训练数据。验证和未分割版本请参考提供的链接。

使用数据注意事项

  • 数据集包含来自各种仓库的源代码,可能包含有害或带有偏见的代码,以及敏感信息如密码或用户名。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作