ArtifactAI/arxiv_research_code

Name: ArtifactAI/arxiv_research_code
Creator: ArtifactAI
Published: 2023-07-26 19:13:22
License: 暂无描述

Hugging Face2023-07-26 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/ArtifactAI/arxiv_research_code

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: repo dtype: string - name: file dtype: string - name: code dtype: string - name: file_length dtype: int64 - name: avg_line_length dtype: float64 - name: max_line_length dtype: int64 - name: extension_type dtype: string splits: - name: train num_bytes: 63445188751 num_examples: 4716175 download_size: 21776760509 dataset_size: 63445188751 license: bigscience-openrail-m task_categories: - text-generation language: - en pretty_name: arxiv_research_code size_categories: - 10B<n<100B --- # Dataset Card for "ArtifactAI/arxiv_research_code" ## Dataset Description https://huggingface.co/datasets/ArtifactAI/arxiv_research_code ### Dataset Summary ArtifactAI/arxiv_research_code contains over 21.8GB of source code files referenced strictly in ArXiv papers. The dataset serves as a curated dataset for Code LLMs. ### How to use it ```python from datasets import load_dataset # full dataset (21.8GB of data) ds = load_dataset("ArtifactAI/arxiv_research_code", split="train") # dataset streaming (will only download the data as needed) ds = load_dataset("ArtifactAI/arxiv_research_code", streaming=True, split="train") for sample in iter(ds): print(sample["code"]) ``` ## Dataset Structure ### Data Instances Each data instance corresponds to one file. The content of the file is in the `code` feature, and other features (`repo`, `file`, etc.) provide some metadata. ### Data Fields - `repo` (string): code repository name. - `file` (string): file path in the repository. - `code` (string): code within the file. - `file_length`: (integer): number of characters in the file. - `avg_line_length`: (float): the average line-length of the file. - `max_line_length`: (integer): the maximum line-length of the file. - `extension_type`: (string): file extension. ### Data Splits The dataset has no splits and all data is loaded as train split by default. ## Dataset Creation ### Source Data #### Initial Data Collection and Normalization 34,099 active GitHub repository names were extracted from [ArXiv](https://arxiv.org/) papers from its inception through July 21st, 2023 totaling 773G of compressed github repositories. These repositories were then filtered, and the code from each file was extracted into 4.7 million files. #### Who are the source language producers? The source (code) language producers are users of GitHub that created unique repository ### Personal and Sensitive Information The released dataset may contain sensitive information such as emails, IP addresses, and API/ssh keys that have previously been published to public repositories on GitHub. ## Additional Information ### Dataset Curators Matthew Kenney, Artifact AI, matt@artifactai.com ### Citation Information ``` @misc{arxiv_research_code, title={arxiv_research_code}, author={Matthew Kenney}, year={2023} } ```

提供机构：

ArtifactAI

原始信息汇总

数据集概述

名称: arxiv_research_code

描述: 该数据集包含超过21.8GB的源代码文件，这些文件严格引用自ArXiv论文。数据集作为Code LLMs的精选数据集。

特征:

repo (字符串): 代码仓库名称。
file (字符串): 仓库中的文件路径。
code (字符串): 文件中的代码。
file_length (整数): 文件中的字符数。
avg_line_length (浮点数): 文件的平均行长度。
max_line_length (整数): 文件的最大行长度。
extension_type (字符串): 文件扩展名。

数据分割:

train: 包含4716175个示例，总大小为63445188751字节。

下载大小: 21776760509字节

数据集大小: 63445188751字节

许可证: bigscience-openrail-m

任务类别: 文本生成

语言: 英语

大小类别: 10B<n<100B

使用方法

python from datasets import load_dataset

完整数据集

ds = load_dataset("ArtifactAI/arxiv_research_code", split="train")

数据流式加载

ds = load_dataset("ArtifactAI/arxiv_research_code", streaming=True, split="train") for sample in iter(ds): print(sample["code"])

数据集创建

源数据:

初始数据收集自34,099个活跃的GitHub仓库，这些仓库从ArXiv论文中提取，总计压缩后的GitHub仓库大小为773G。
经过过滤，从每个文件中提取代码，形成4.7百万个文件。

潜在敏感信息:

数据集可能包含敏感信息，如电子邮件、IP地址、API/ssh密钥等，这些信息已公开发布在GitHub的公共仓库中。

5,000+

优质数据集

54 个

任务类型

进入经典数据集