Ling-Coder-Lite Source Code Data
收藏arXiv2025-09-30 收录
下载链接:
https://huggingface.co/inclusionAI/Ling-Coder-lite
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了跨越618种编程语言的优质源代码数据,经过筛选和整理,专门用于大型语言模型(LLM)的训练。数据集不仅包括原始源代码、与代码相关的文本,还包含了合成的问答数据。所有数据都经过标准化处理流程,以确保数据的质量和符合收录标准。该数据集的规模达到了1100亿个代码数据标记,旨在支持代码生成和理解的任务。
This dataset contains high-quality source code data spanning 618 programming languages, which has been screened and curated specifically for training Large Language Models (LLMs). It includes not only raw source code and code-related text, but also synthetic question-answer data. All data has undergone a standardized processing pipeline to ensure data quality and adherence to inclusion criteria. The dataset totals 110 billion code tokens, and is designed to support code generation and understanding tasks.
提供机构:
inclusionAI



