AlgorithmicResearchGroup/arxiv_research_code
收藏Hugging Face2024-09-04 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/AlgorithmicResearchGroup/arxiv_research_code
下载链接
链接失效反馈官方服务:
资源简介:
ArtifactAI/arxiv_research_code数据集包含超过21.8GB的源代码文件,这些文件严格引用自ArXiv论文。该数据集作为代码大语言模型(Code LLMs)的精选数据集。数据集的特征包括代码库名称、文件路径、代码内容、文件长度、平均行长度、最大行长度和文件扩展类型。数据集没有分割,所有数据默认作为训练集加载。数据集的创建过程涉及从ArXiv论文中提取GitHub仓库名称,并过滤和提取代码文件。数据集可能包含敏感信息,如电子邮件、IP地址和API/ssh密钥。
ArtifactAI/arxiv_research_code contains over 21.8GB of source code files referenced strictly in ArXiv papers. The dataset serves as a curated dataset for Code LLMs. Each data instance corresponds to one file, with the file content in the `code` feature and other features (`repo`, `file`, etc.) providing metadata. The dataset has no splits and all data is loaded as the train split by default. The dataset may contain sensitive information such as emails, IP addresses, and API/SSH keys.
提供机构:
AlgorithmicResearchGroup



