five

AlgorithmicResearchGroup/arxiv_python_research_code

收藏
Hugging Face2024-09-04 更新2025-04-08 收录
下载链接:
https://hf-mirror.com/datasets/AlgorithmicResearchGroup/arxiv_python_research_code
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: repo dtype: string - name: file dtype: string - name: code dtype: string - name: file_length dtype: int64 - name: avg_line_length dtype: float64 - name: max_line_length dtype: int64 - name: extension_type dtype: string splits: - name: train num_bytes: 12984199778 num_examples: 1415924 download_size: 4073853616 dataset_size: 12984199778 license: bigcode-openrail-m task_categories: - text-generation language: - en pretty_name: arxiv_python_research_code size_categories: - 1B<n<10B --- # Dataset Card for "ArtifactAI/arxiv_python_research_code" ## Dataset Description https://huggingface.co/datasets/AlgorithmicResearchGroup/arxiv_python_research_code ### Dataset Summary AlgorithmicResearchGroup/arxiv_python_research_code contains over 4.13GB of source code files referenced strictly in ArXiv papers. The dataset serves as a curated dataset for Code LLMs. ### How to use it ```python from datasets import load_dataset # full dataset (4.13GB of data) ds = load_dataset("AlgorithmicResearchGroup/arxiv_python_research_code", split="train") # dataset streaming (will only download the data as needed) ds = load_dataset("AlgorithmicResearchGroup/arxiv_python_research_code", streaming=True, split="train") for sample in iter(ds): print(sample["code"]) ``` ## Dataset Structure ### Data Instances Each data instance corresponds to one file. The content of the file is in the `code` feature, and other features (`repo`, `file`, etc.) provide some metadata. ### Data Fields - `repo` (string): code repository name. - `file` (string): file path in the repository. - `code` (string): code within the file. - `file_length`: (integer): number of characters in the file. - `avg_line_length`: (float): the average line-length of the file. - `max_line_length`: (integer): the maximum line-length of the file. - `extension_type`: (string): file extension. ### Data Splits The dataset has no splits and all data is loaded as train split by default. ## Dataset Creation ### Source Data #### Initial Data Collection and Normalization 34,099 active GitHub repository names were extracted from [ArXiv](https://arxiv.org/) papers from its inception through July 21st, 2023 totaling 773G of compressed github repositories. These repositories were then filtered, and the code from each '.py' file extension was extracted into 1.4 million files. #### Who are the source language producers? The source (code) language producers are users of GitHub that created unique repository ### Personal and Sensitive Information The released dataset may contain sensitive information such as emails, IP addresses, and API/ssh keys that have previously been published to public repositories on GitHub. ## Additional Information ### Dataset Curators Matthew Kenney, AlgorithmicResearchGroup, matt@algorithmicresearchgroup.com ### Citation Information ``` @misc{arxiv_python_research_code, title={arxiv_python_research_code}, author={Matthew Kenney}, year={2023} } ```
提供机构:
AlgorithmicResearchGroup
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作