AlgorithmicResearchGroup/arxiv_cplusplus_research_code
收藏Hugging Face2024-09-04 更新2025-04-08 收录
下载链接:
https://hf-mirror.com/datasets/AlgorithmicResearchGroup/arxiv_cplusplus_research_code
下载链接
链接失效反馈官方服务:
资源简介:
---
license: openrail
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
dataset_info:
features:
- name: repo
dtype: string
- name: file
dtype: string
- name: code
dtype: string
- name: file_length
dtype: int64
- name: avg_line_length
dtype: float64
- name: max_line_length
dtype: int64
- name: extension_type
dtype: string
splits:
- name: train
num_bytes: 21983781651.45426
num_examples: 1634156
download_size: 10635788503
dataset_size: 21983781651.45426
task_categories:
- text-generation
language:
- en
pretty_name: arxiv_cplusplus_research_code
size_categories:
- 10B<n<100B
---
# Dataset card for ArtifactAI/arxiv_cplusplus_research_code
## Dataset Description
https://huggingface.co/datasets/AlgorithmicResearchGroup/arxiv_cplusplus_research_code
### Dataset Summary
ArtifactAI/arxiv_python_research_code contains over 10.6GB of source code files referenced strictly in ArXiv papers. The dataset serves as a curated dataset for Code LLMs.
### How to use it
```python
from datasets import load_dataset
# full dataset (10.6GB of data)
ds = load_dataset("AlgorithmicResearchGroup/arxiv_cplusplus_research_code", split="train")
# dataset streaming (will only download the data as needed)
ds = load_dataset("AlgorithmicResearchGroup/arxiv_cplusplus_research_code", streaming=True, split="train")
for sample in iter(ds): print(sample["code"])
```
## Dataset Structure
### Data Instances
Each data instance corresponds to one file. The content of the file is in the `code` feature, and other features (`repo`, `file`, etc.) provide some metadata.
### Data Fields
- `repo` (string): code repository name.
- `file` (string): file path in the repository.
- `code` (string): code within the file.
- `file_length`: (integer): number of characters in the file.
- `avg_line_length`: (float): the average line-length of the file.
- `max_line_length`: (integer): the maximum line-length of the file.
- `extension_type`: (string): file extension.
### Data Splits
The dataset has no splits and all data is loaded as train split by default.
## Dataset Creation
### Source Data
#### Initial Data Collection and Normalization
34,099 active GitHub repository names were extracted from [ArXiv](https://arxiv.org/) papers from its inception through July 21st, 2023 totaling 773G of compressed github repositories.
These repositories were then filtered, and the code from each of "cpp", "cxx", "cc", "h", "hpp", "hxx" file extension was extracted into 1.4 million files.
#### Who are the source language producers?
The source (code) language producers are users of GitHub that created unique repository
### Personal and Sensitive Information
The released dataset may contain sensitive information such as emails, IP addresses, and API/ssh keys that have previously been published to public repositories on GitHub.
## Additional Information
### Dataset Curators
Matthew Kenney, AlgorithmicResearchGroup, matt@algorithmicresearchgroup.com
### Citation Information
```
@misc{arxiv_cplusplus_research_code,
title={arxiv_cplusplus_research_code},
author={Matthew Kenney},
year={2023}
}
```
许可证:OpenRail
配置项:
- 配置名称:默认
数据文件:
- 拆分集:训练集
路径:data/train-*
数据集信息:
特征字段:
- 字段名:代码仓库(repo),数据类型:字符串(string)
- 字段名:文件路径(file),数据类型:字符串(string)
- 字段名:源代码(code),数据类型:字符串(string)
- 字段名:文件总字符数(file_length),数据类型:64位整数(int64)
- 字段名:平均行长度(avg_line_length),数据类型:双精度浮点数(float64)
- 字段名:最大行长度(max_line_length),数据类型:64位整数(int64)
- 字段名:文件扩展名类型(extension_type),数据类型:字符串(string)
拆分集信息:
- 拆分名称:训练集
字节大小:21983781651.45426
样本数量:1634156
下载总大小:10635788503
数据集总大小:21983781651.45426
任务类别:
- 文本生成(text-generation)
语言:
- 英语(en)
显示名称:arxiv_cplusplus_research_code
规模区间:100亿<n<1000亿(10B<n<100B)
# 数据集卡片:ArtifactAI/arxiv_cplusplus_research_code
## 数据集描述
https://huggingface.co/datasets/AlgorithmicResearchGroup/arxiv_cplusplus_research_code
### 数据集摘要
ArtifactAI/arxiv_cplusplus_research_code数据集包含逾10.6GB的源代码文件,这些文件均严格引自ArXiv论文,本数据集是专为代码大语言模型(Code LLM)打造的精选数据集。
### 使用方法
python
from datasets import load_dataset
# 加载完整数据集(数据量10.6GB)
ds = load_dataset("AlgorithmicResearchGroup/arxiv_cplusplus_research_code", split="train")
# 流式加载数据集(仅在需要时下载数据)
ds = load_dataset("AlgorithmicResearchGroup/arxiv_cplusplus_research_code", streaming=True, split="train")
for sample in iter(ds): print(sample["code"])
## 数据集结构
### 数据实例
每个数据实例对应一个代码文件。文件内容存储于`code`特征字段中,其余特征字段(如`repo`、`file`等)提供相关元数据。
### 数据字段
- `repo`(字符串类型):代码仓库名称。
- `file`(字符串类型):代码仓库内的文件路径。
- `code`(字符串类型):文件内的源代码内容。
- `file_length`(整数类型):文件的总字符数。
- `avg_line_length`(浮点类型):文件的平均行长度。
- `max_line_length`(整数类型):文件的最大行长度。
- `extension_type`(字符串类型):文件扩展名。
### 数据拆分
本数据集无额外拆分,默认将全部数据加载为训练集。
## 数据集构建
### 源数据
#### 初始数据收集与归一化
从ArXiv(https://arxiv.org/)自建馆至2023年7月21日的论文中,提取出34099个活跃GitHub代码仓库,总计773GB压缩后的GitHub仓库数据。
随后对这些仓库进行筛选,提取扩展名为`cpp`、`cxx`、`cc`、`h`、`hpp`、`hxx`的代码文件,共计140万个文件。
#### 源语言生产者
本数据集的源代码生产者为创建过独立代码仓库的GitHub用户。
## 个人与敏感信息
本发布的数据集可能包含敏感信息,例如此前发布至GitHub公共仓库的电子邮箱、IP地址以及API/SSH密钥等。
## 补充信息
### 数据集策展人
Matthew Kenney,AlgorithmicResearchGroup,邮箱:matt@algorithmicresearchgroup.com
### 引用信息
@misc{arxiv_cplusplus_research_code,
title={arxiv_cplusplus_research_code},
author={Matthew Kenney},
year={2023}
}
提供机构:
AlgorithmicResearchGroup



