five

Ujjwal-Tyagi/Java-Code-Large

收藏
Hugging Face2026-03-30 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Ujjwal-Tyagi/Java-Code-Large
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation language: - en tags: - code - java size_categories: - 10M<n<100M --- **Java-Code-Large** Java-Code-Large is a large-scale corpus of publicly available Java source code comprising more than **15 million** java codes. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis. By providing a high-volume, language-specific corpus, Java-Code-Large enables systematic experimentation in Java-focused model training, domain adaptation, and downstream code understanding tasks. **1. Introduction** Large-scale code corpora have become fundamental resources for training and evaluating machine learning models for code-related tasks. While multilingual code datasets exist, there is increasing interest in language-specialized corpora to: - Improve domain-specific performance - Reduce cross-language noise - Enable controlled experimental settings - Support Java-specific tooling and research Java-Code-Large addresses this need by providing a dedicated Java-only dataset at substantial scale. **2. Dataset Composition** Programming Language: Java File Count: 15M+ Java files File Format: .jsonl Content Types: - Classes - Interfaces - Enums - Methods - Annotations - JavaDoc comments - Exception handling structures - Generics and concurrency constructs The dataset consists of source code extracted from publicly accessible open-source repositories. **3. Intended Research Applications** 3.1 Pretraining - Training code foundation models from scratch - Continued pretraining of existing LLMs - Java-specialized language modeling 3.2 Fine-Tuning and Adaptation - Code completion systems - Automated refactoring tools - IDE copilots - Java-specific conversational assistants 3.3 Code Intelligence Tasks - Code summarization - Code-to-text generation - Bug detection - Vulnerability detection - Clone detection - Code similarity modeling - Static and structural analysis 3.4 Software Engineering Research - Empirical studies of Java programming patterns - Tokenization and AST modeling experiments Thanks to open source community for all the guidance & support!!
提供机构:
Ujjwal-Tyagi
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作