ArtifactAI/arxiv_cplusplus_research_code
收藏Hugging Face2023-07-27 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ArtifactAI/arxiv_cplusplus_research_code
下载链接
链接失效反馈官方服务:
资源简介:
---
license: openrail
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
dataset_info:
features:
- name: repo
dtype: string
- name: file
dtype: string
- name: code
dtype: string
- name: file_length
dtype: int64
- name: avg_line_length
dtype: float64
- name: max_line_length
dtype: int64
- name: extension_type
dtype: string
splits:
- name: train
num_bytes: 21983781651.45426
num_examples: 1634156
download_size: 10635788503
dataset_size: 21983781651.45426
task_categories:
- text-generation
language:
- en
pretty_name: arxiv_cplusplus_research_code
size_categories:
- 10B<n<100B
---
# Dataset card for ArtifactAI/arxiv_cplusplus_research_code
## Dataset Description
https://huggingface.co/datasets/ArtifactAI/arxiv_cplusplus_research_code
### Dataset Summary
ArtifactAI/arxiv_python_research_code contains over 10.6GB of source code files referenced strictly in ArXiv papers. The dataset serves as a curated dataset for Code LLMs.
### How to use it
```python
from datasets import load_dataset
# full dataset (10.6GB of data)
ds = load_dataset("ArtifactAI/arxiv_cplusplus_research_code", split="train")
# dataset streaming (will only download the data as needed)
ds = load_dataset("ArtifactAI/arxiv_cplusplus_research_code", streaming=True, split="train")
for sample in iter(ds): print(sample["code"])
```
## Dataset Structure
### Data Instances
Each data instance corresponds to one file. The content of the file is in the `code` feature, and other features (`repo`, `file`, etc.) provide some metadata.
### Data Fields
- `repo` (string): code repository name.
- `file` (string): file path in the repository.
- `code` (string): code within the file.
- `file_length`: (integer): number of characters in the file.
- `avg_line_length`: (float): the average line-length of the file.
- `max_line_length`: (integer): the maximum line-length of the file.
- `extension_type`: (string): file extension.
### Data Splits
The dataset has no splits and all data is loaded as train split by default.
## Dataset Creation
### Source Data
#### Initial Data Collection and Normalization
34,099 active GitHub repository names were extracted from [ArXiv](https://arxiv.org/) papers from its inception through July 21st, 2023 totaling 773G of compressed github repositories.
These repositories were then filtered, and the code from each of "cpp", "cxx", "cc", "h", "hpp", "hxx" file extension was extracted into 1.4 million files.
#### Who are the source language producers?
The source (code) language producers are users of GitHub that created unique repository
### Personal and Sensitive Information
The released dataset may contain sensitive information such as emails, IP addresses, and API/ssh keys that have previously been published to public repositories on GitHub.
## Additional Information
### Dataset Curators
Matthew Kenney, Artifact AI, matt@artifactai.com
### Citation Information
```
@misc{arxiv_cplusplus_research_code,
title={arxiv_cplusplus_research_code},
author={Matthew Kenney},
year={2023}
}
```
提供机构:
ArtifactAI
原始信息汇总
数据集概述
数据集名称: arxiv_cplusplus_research_code
数据集大小: 10.6GB
数据集用途: 用于Code LLMs的源代码文件数据集,这些文件严格引用自ArXiv论文。
数据集特征
- repo (字符串): 代码仓库名称。
- file (字符串): 仓库中的文件路径。
- code (字符串): 文件中的代码内容。
- file_length (整数): 文件中的字符数。
- avg_line_length (浮点数): 文件中行的平均长度。
- max_line_length (整数): 文件中行的最大长度。
- extension_type (字符串): 文件扩展名。
数据集结构
- 数据实例: 每个数据实例对应一个文件,文件内容位于
code特征中,其他特征提供元数据。 - 数据字段: 如上所述的特征。
- 数据分割: 数据集无分割,默认加载为训练集。
数据集创建
- 源数据: 从ArXiv论文中提取的34,099个活跃GitHub仓库名称,总计773G的压缩GitHub仓库。
- 数据过滤与提取: 从"cpp", "cxx", "cc", "h", "hpp", "hxx"文件扩展名中提取代码,形成1.4百万个文件。
- 潜在敏感信息: 数据集可能包含敏感信息,如电子邮件、IP地址、API/ssh密钥。
数据集许可证
- 许可证: openrail
引用信息
@misc{arxiv_cplusplus_research_code, title={arxiv_cplusplus_research_code}, author={Matthew Kenney}, year={2023} }



