CodedotAI/code_clippy

Name: CodedotAI/code_clippy
Creator: CodedotAI
Published: 2022-11-17 19:54:28
License: 暂无描述

Hugging Face2022-11-17 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/CodedotAI/code_clippy

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - no-annotation language_creators: - crowdsourced language: - code license: - gpl-3.0 multilinguality: - multilingual size_categories: - unknown source_datasets: - original task_categories: - text-generation task_ids: - language-modeling pretty_name: Code Clippy --- # Dataset Card for Code Clippy Data ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://the-eye.eu/public/AI/training_data/code_clippy_data/ - **Repository:** https://github.com/ncoop57/gpt-code-clippy - **Paper:** [Not yet :)] - **Leaderboard:** [Not yet :)] - **Point of Contact:** [Nathan Cooper](mailto@nacooper01@email.wm.edu) ### Dataset Summary This dataset was generated by selecting GitHub repositories from a large collection of repositories. These repositories were collected from https://seart-ghs.si.usi.ch/ and Github portion of [The Pile](https://github.com/EleutherAI/github-downloader) (performed on July 7th, 2021). The goal of this dataset is to provide a training set for pretraining large language models on code data for helping software engineering researchers better understand their impacts on software related tasks such as autocompletion of code. The dataset is split into train, validation, and test splits. There is a version containing duplicates (209GBs compressed) and ones where exact duplicates (132GBs compressed) are removed. Contains mostly JavaScript and Python code, but other programming languages are included as well to various degrees. ### Supported Tasks and Leaderboards - `language-modeling`: The dataset can be used to train a model for language modeling for modeling programming languages, which consists of pretraining/finetuning a model to predict missing tokens, either causally or masked, given some context. Success on this task is typically measured by achieving a *low* perplexity score. ### Languages Multiple programming languages are included in the dataset. ## Dataset Structure ### Data Instances ``` { "id": datasets.Value("int64"), "text": datasets.Value("string"), "repo_name": datasets.Value("string"), "stars": datasets.Value("string"), "repo_language": datasets.Value("string"), "file_name": datasets.Value("string"), "mime_type": datasets.Value("string") } ``` ### Data Fields - `id`: A unique identifier for the data instance. - `text`: The text of the code. - `repo_name`: The name of the repository. - `stars`: The number of stars the repository has. - `repo_language`: The programming language of the repository. - `file_name`: The name of the file. - `mime_type`: The MIME type of the file. ### Data Splits | Size in GBs | Tain | Valid | Test | | ----- | ------ | ----- | ---- | | Duplicate | 194 | 9 | 6.3 | | Deduplicate | 126 | 3.3 | 3.1 | ## Dataset Creation ### Curation Rationale To have a code dataset that is large enough to properly train a large language model on. ### Source Data #### Initial Data Collection and Normalization - [The Pile](https://github.com/EleutherAI/github-downloader) - [Seart-GHS](https://seart-ghs.si.usi.ch/) Repositories were collected from both sources and the helper script from https://github.com/EleutherAI/github-downloader was used to download the repositories. Files were scrapped from the downloaded repositories, but ignored files that had certain extensions associated with binary or other non-textual/autogenerated content, and the output was converted into the [LM_Dataformat](https://pypi.org/project/lm-dataformat/) format. #### Who are the source language producers? Software developers. ### Annotations #### Annotation process No annotation was performed. #### Who are the annotators? N/A ### Personal and Sensitive Information Since this data was collected from public repositories, there exists potential for personal and sensitive information to be included in the data through developers accidentally or on purpose uploading their secret keys, passwords, API keys, emails, etc. ## Considerations for Using the Data ### Social Impact of Dataset The paper ["Evaluating Large Language Models Trained on Code"](https://arxiv.org/abs/2107.03374) from OpenAI has a good discussion on what the impact of a large language model trained on code could be. Therefore, some parts of their discuss are highlighted here as it pertains to this dataset and models that may be trained from it. **As well as some differences in views from the paper, particularly around legal implications**. 1. **Over-reliance:** A language model trained on large datasets such as this one for the task of autogenerating code may generate plausible solutions that may appear correct, but are not necessarily the correct solution. Not properly evaluating the generated code may cause have negative consequences such as the introduction of bugs, or the introduction of security vulnerabilities. Therefore, it is important that users are aware of the limitations and potential negative consequences of using a language model trained on this dataset. 2. **Economic and labor market impacts:** Large language models trained on large code datasets such as this one that are capable of generating high-quality code have the potential to automate part of the software development process. This may negatively impact software developers. However, as discussed in the paper, as shown in the Summary Report of software developers from [O*NET OnLine](https://www.onetonline.org/link/summary/15-1252.00), developers don't just write software. 3. **Security implications:** No filtering or checking of vulnerabilities or buggy code was performed. This means that the dataset may contain code that may be malicious or contain vulnerabilities. Therefore, any model trained on this dataset may generate vulnerable, buggy, or malicious code. In safety critical software, this could lead to software that may work improperly and could result in serious consequences depending on the software. Additionally, a model trained on this dataset may be used to generate malicious code on purpose in order to perform ransomware or other such attacks. 4. **Legal implications:** No filtering was performed on licensed code. This means that the dataset may contain restrictive licensed code. As discussed in the paper, public Github repositories may fall under "fair use." However, there has been little to no previous cases of such usages of licensed publicly available code. Therefore, any model trained on this dataset may be required to obey license terms that align with the software it was trained on such as GPL-3.0, which is why we purposefully put this dataset under the GPL-3.0 license. It is unclear the legal ramifications of using a language model trained on this dataset. ### Discussion of Biases The programming languages most represented in this dataset are those of Javascript and Python. Therefore, other, still popular languages such as C and C++, are less represented and therefore model performance for these languages will be less comparatively. Additionally, this dataset only contains public repositories and so may not be representative of code written by private developers. No filtering was performed for potential racist, offensive, or otherwise inappropriate content. Therefore there may be such content in the dataset that will be reflected in models trained on it. ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators Nathan Coooper, Artashes Arutiunian, Santiago Hincapié-Potes, Ben Trevett, Arun Raja, Erfan Hossami, Mrinal Mathur, and contributors! ### Licensing Information This repository is under the GPL-3.0 license. ### Citation Information ``` @misc{cooper-2021-code-clippy-data, author = {Nathan Coooper, Artashes Arutiunian, Santiago Hincapié-Potes, Ben Trevett, Arun Raja, Erfan Hossami, Mrinal Mathur, and contributors}, title = {{Code Clippy Data: A large dataset of code data from Github for research into code language models}}, month = jul, year = 2021, version = {1.0}, publisher = {GitHub}, url = {https://github.com/ncoop57/gpt-code-clippy} } ``` ### Contributions Thanks to [@ncoop57](https://github.com/ncoop57), [@arampacha](https://github.com/arampacha), [@shpotes](https://github.com/shpotes), [@bentrevett](https://github.com/bentrevett), [@arunraja-hub](https://github.com/arunraja-hub), [@taisazero](https://github.com/taisazero), [@Mrinal18](https://github.com/Mrinal18), and contributors for adding this dataset.

提供机构：

CodedotAI

原始信息汇总

数据集概述

数据集名称

名称: Code Clippy

数据集摘要

摘要: 该数据集由GitHub仓库中收集的代码组成，主要用于预训练大型语言模型以辅助软件工程研究，如代码自动补全。数据集包含训练、验证和测试分割，主要包含JavaScript和Python代码，以及其他编程语言。

支持的任务和评估指标

任务: 语言建模
评估指标: 低困惑度

语言

语言: 多语言编程语言

数据集结构

数据实例: 每个实例包含id、文本、仓库名称、星数、仓库语言、文件名和MIME类型。
数据分割: 数据集分为有重复和无重复版本，大小分别为209GB和132GB。

数据集创建

来源: 数据来自The Pile和Seart-GHS的GitHub仓库。
注释: 无注释。
个人和敏感信息: 数据可能包含开发者无意中上传的敏感信息。

使用数据的考虑

社会影响: 可能影响软件开发过程和开发者就业。
偏见: 数据集主要包含JavaScript和Python代码，可能不全面代表所有编程语言。
法律影响: 数据集可能包含受限制的许可代码，使用时需遵守GPL-3.0许可。

数据集管理

管理者: Nathan Cooper等
许可: GPL-3.0

引用信息

@misc{cooper-2021-code-clippy-data, author = {Nathan Coooper, Artashes Arutiunian, Santiago Hincapié-Potes, Ben Trevett, Arun Raja, Erfan Hossami, Mrinal Mathur, and contributors}, title = {{Code Clippy Data: A large dataset of code data from Github for research into code language models}}, month = jul, year = 2021, version = {1.0}, publisher = {GitHub}, url = {https://github.com/ncoop57/gpt-code-clippy} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集