mvasiliniuc/iva-kotlin-codeint-clean-train
收藏Hugging Face2023-06-15 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/mvasiliniuc/iva-kotlin-codeint-clean-train
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- crowdsourced
license: other
language_creators:
- crowdsourced
language:
- code
task_categories:
- text-generation
tags:
- code, kotlin, native Android development, curated, training
size_categories:
- 100K<n<1M
source_datasets: []
pretty_name: iva-kotlin-codeint-clean
task_ids:
- language-modeling
---
# IVA Kotlin GitHub Code Dataset
## Dataset Description
This is the curated train split of IVA Kotlin dataset extracted from GitHub.
It contains curated Kotlin files gathered with the purpose to train a code generation model.
The dataset consists of 383380 Kotlin code files from GitHub.
[Here is the unsliced curated dataset](https://huggingface.co/datasets/mvasiliniuc/iva-kotlin-codeint-clean) and [here is the raw dataset](https://huggingface.co/datasets/mvasiliniuc/iva-kotlin-codeint).
### How to use it
To download the full dataset:
```python
from datasets import load_dataset
dataset = load_dataset('mvasiliniuc/iva-kotlin-codeint-clean-train', split='train'))
```
## Data Structure
### Data Fields
|Field|Type|Description|
|---|---|---|
|repo_name|string|name of the GitHub repository|
|path|string|path of the file in GitHub repository|
|copies|string|number of occurrences in dataset|
|content|string|content of source file|
|size|string|size of the source file in bytes|
|license|string|license of GitHub repository|
|hash|string|Hash of content field.|
|line_mean|number|Mean line length of the content.
|line_max|number|Max line length of the content.
|alpha_frac|number|Fraction between mean and max line length of content.
|ratio|number|Character/token ratio of the file with tokenizer.
|autogenerated|boolean|True if the content is autogenerated by looking for keywords in the first few lines of the file.
|config_or_test|boolean|True if the content is a configuration file or a unit test.
|has_no_keywords|boolean|True if a file has none of the keywords for Kotlin Programming Language.
|has_few_assignments|boolean|True if file uses symbol '=' less than `minimum` times.
### Instance
```json
{
"repo_name":"oboenikui/UnivCoopFeliCaReader",
"path":"app/src/main/java/com/oboenikui/campusfelica/ScannerActivity.kt",
"copies":"1",
"size":"5635",
"content":"....",
"license":"apache-2.0",
"hash":"e88cfd99346cbef640fc540aac3bf20b",
"line_mean":37.8620689655,
"line_max":199,
"alpha_frac":0.5724933452,
"ratio":5.0222816399,
"autogenerated":false,
"config_or_test":false,
"has_no_keywords":false,
"has_few_assignments":false
}
```
## Languages
The dataset contains only Kotlin files.
```json
{
"Kotlin": [".kt"]
}
```
## Licenses
Each entry in the dataset contains the associated license. The following is a list of licenses involved and their occurrences.
```json
{
"agpl-3.0":3209,
"apache-2.0":90782,
"artistic-2.0":130,
"bsd-2-clause":380,
"bsd-3-clause":3584,
"cc0-1.0":155,
"epl-1.0":792,
"gpl-2.0":4432,
"gpl-3.0":19816,
"isc":345,
"lgpl-2.1":118,
"lgpl-3.0":2689,
"mit":31470,
"mpl-2.0":1444,
"unlicense":654
}
```
## Dataset Statistics
```json
{
"Total size": "~207 MB",
"Number of files": 160000,
"Number of files under 500 bytes": 2957,
"Average file size in bytes": 5199,
}
```
## Curation Process
See [the unsliced curated dataset](https://huggingface.co/datasets/mvasiliniuc/iva-kotlin-codeint-clean) for mode details.
## Data Splits
The dataset only contains a train split focused only on training data. For validation and unspliced versions, please check the following links:
* Clean Version Unsliced: https://huggingface.co/datasets/mvasiliniuc/iva-kotlin-codeint-clean
* Clean Version Valid: https://huggingface.co/datasets/mvasiliniuc/iva-kotlin-codeint-clean-valid
# Considerations for Using the Data
The dataset comprises source code from various repositories, potentially containing harmful or biased code,
along with sensitive information such as passwords or usernames.
提供机构:
mvasiliniuc
原始信息汇总
数据集概述
数据集名称
- iva-kotlin-codeint-clean
数据集描述
- 该数据集是从GitHub提取的IVA Kotlin数据集的精选训练部分,包含383,380个精选的Kotlin代码文件,旨在用于训练代码生成模型。
数据集使用方法
- 通过以下代码下载完整数据集: python from datasets import load_dataset dataset = load_dataset(mvasiliniuc/iva-kotlin-codeint-clean-train, split=train)
数据结构
- 数据字段:
- repo_name: 字符串,GitHub仓库名称
- path: 字符串,文件在GitHub仓库中的路径
- copies: 字符串,数据集中出现的次数
- content: 字符串,源文件内容
- size: 字符串,源文件大小(字节)
- license: 字符串,GitHub仓库的许可证
- hash: 字符串,内容字段的哈希值
- line_mean: 数字,内容的平均行长度
- line_max: 数字,内容的最大行长度
- alpha_frac: 数字,内容平均和最大行长度之间的比例
- ratio: 数字,文件的字符/标记比例
- autogenerated: 布尔值,如果内容是通过查找文件前几行的关键字自动生成的,则为True
- config_or_test: 布尔值,如果内容是配置文件或单元测试,则为True
- has_no_keywords: 布尔值,如果文件没有Kotlin编程语言的关键字,则为True
- has_few_assignments: 布尔值,如果文件使用符号=少于
minimum次,则为True
语言
- 数据集仅包含Kotlin文件。
许可证
- 数据集中的每个条目都包含相关的许可证。涉及的许可证及其出现次数如下:
- agpl-3.0: 3209次
- apache-2.0: 90782次
- artistic-2.0: 130次
- bsd-2-clause: 380次
- bsd-3-clause: 3584次
- cc0-1.0: 155次
- epl-1.0: 792次
- gpl-2.0: 4432次
- gpl-3.0: 19816次
- isc: 345次
- lgpl-2.1: 118次
- lgpl-3.0: 2689次
- mit: 31470次
- mpl-2.0: 1444次
- unlicense: 654次
数据集统计
- 总大小:约207 MB
- 文件数量:160,000个
- 小于500字节的文件数量:2957个
- 平均文件大小:5199字节
数据分割
- 数据集仅包含训练分割,专注于训练数据。验证和未分割版本请参考提供的链接。
使用数据注意事项
- 数据集包含来自各种仓库的源代码,可能包含有害或带有偏见的代码,以及敏感信息如密码或用户名。



