five

AdityaNarayan/HS-Repo-Curriculum-Learning

收藏
Hugging Face2025-12-10 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/AdityaNarayan/HS-Repo-Curriculum-Learning
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en tags: - code - rust - payment-processing - curriculum-learning - continued-pretraining - hyperswitch size_categories: - 10K<n<100K task_categories: - text-generation pretty_name: Hyperswitch Curriculum Learning Dataset (Unbroken) --- # Hyperswitch Curriculum Learning Dataset (Unbroken) A comprehensive dataset for continued pre-training (CPT) of large language models on the [Hyperswitch](https://github.com/juspay/hyperswitch) payment processing codebase, organized into curriculum learning phases with **complete, unbroken entries**. ## 🎯 Dataset Overview This dataset contains the complete Hyperswitch repository knowledge extracted from: - **Source code files** (.rs, .toml, .yaml, .json, .md) - **Git commit history** with full diffs - **GitHub Pull Requests** with reviews and discussions - **Test-implementation pairs** **Key Feature**: Unlike the chunked version, each entry is stored **complete** without breaking at token boundaries, allowing dynamic chunking during training for any sequence length (8K, 16K, 32K, 64K+). ## 📊 Dataset Structure ### Curriculum Learning Phases The dataset is organized into 3 progressive phases: #### **Phase 1: Code Foundation** (`phase1_foundation.jsonl`) - **Content**: Repository files + test-implementation pairs - **Purpose**: Learn codebase structure, syntax, and testing patterns - **Training**: 2 epochs - **Entries**: Complete files and test pairs (unbroken) #### **Phase 2: Evolution Patterns** (`phase2_evolution.jsonl`) - **Content**: Git commits (chronological) + small PRs - **Purpose**: Understand code evolution, change patterns, and incremental development - **Training**: 2-3 epochs - **Entries**: Complete commits with full diffs, small PRs (unbroken) #### **Phase 3: PR Mastery** (`phase3_pr_mastery.jsonl`) - **Content**: Medium and large PRs with reviews and discussions - **Purpose**: Master complex changes, code review practices, and collaboration patterns - **Training**: 3-4 epochs - **Entries**: Complete PRs with all reviews and comments (unbroken) ## 📁 Data Format Each entry is a single JSON object per line (JSONL format): ### File Entry ```json { "type": "file", "path": "crates/hyperswitch_connectors/src/connectors/paypal/transformers.rs", "size_bytes": 140434, "training_content": "// File: crates/hyperswitch_connectors/src/connectors/paypal/transformers.rs\n\n<complete_file_content>" } ``` ### Commit Entry ```json { "type": "commit", "commit_hash": "73203ebd05beab57f243e8460f259707bb856921", "author": "vasanthp-jus", "date": "2025-11-27T12:18:26+05:30", "message": "fix-postman-collection", "training_content": "Commit: \"fix-postman-collection\"\nAuthor: vasanthp-jus\nDate: 2025-11-27T12:18:26+05:30\n\nDiff:\n<complete_git_diff>" } ``` ### PR Entry ```json { "type": "pr_diff", "pr_number": 1234, "title": "Add PayPal connector support", "state": "merged", "author": "developer-name", "created_at": "2025-11-15T10:30:00Z", "training_content": "PR #1234: Add PayPal connector support\n\n<description>\n\nReviews:\n<complete_reviews>\n\nComments:\n<complete_comments>" } ``` ### Test Pair Entry ```json { "type": "test_pair", "test_file": "crates/router/tests/connector_tests.rs", "impl_file": "crates/router/src/connector.rs", "training_content": "Test-Implementation Pair:\n\nTest: <test_content>\n\nImplementation: <impl_content>" } ``` ## 🔢 Dataset Statistics | Phase | Entries | Content Types | Avg Entry Size | |-------|---------|---------------|----------------| | Phase 1 | ~15K | Files, Test Pairs | Varies (complete files) | | Phase 2 | ~5K | Commits, Small PRs | Varies (complete commits/PRs) | | Phase 3 | ~1K | Medium/Large PRs | Large (complete PR threads) | **Total**: ~21K complete, unbroken entries ## 💡 Unbroken vs Chunked ### Unbroken (This Dataset) ✅ Complete semantic units preserved ✅ No artificial breaks in code/diffs ✅ Flexible for any sequence length ✅ Chunk dynamically during training ✅ Smaller dataset file size (no overlap) ### Chunked (Alternative) - Pre-chunked at fixed token limit (e.g., 8K) - Ready for immediate training - Fixed sequence length - Includes chunk overlap for continuity ## 🚀 Usage ### Loading the Dataset ```python import json def load_phase(phase_file): """Load a curriculum phase.""" entries = [] with open(phase_file, 'r', encoding='utf-8') as f: for line in f: entries.append(json.loads(line)) return entries # Load Phase 1 phase1 = load_phase('phase1_foundation.jsonl') ``` ### Dynamic Chunking for Training ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("your-model") max_length = 32768 # 32K tokens def chunk_entry(entry, tokenizer, max_length): """Chunk a complete entry for training.""" text = entry['training_content'] # Tokenize tokens = tokenizer(text, truncation=False, return_tensors='pt') # Split into chunks if needed chunks = [] token_ids = tokens['input_ids'][0] for i in range(0, len(token_ids), max_length): chunk = token_ids[i:i + max_length] chunks.append(chunk) return chunks # Process entries for entry in phase1: chunks = chunk_entry(entry, tokenizer, max_length) for chunk in chunks: # Use chunk for training pass ``` ### Recommended Training Schedule ```python # Phase 1: Code Foundation (2 epochs) train(phase1_foundation, epochs=2, lr=1e-5) # Phase 2: Evolution Patterns (2-3 epochs) train(phase2_evolution, epochs=3, lr=8e-6) # Phase 3: PR Mastery (3-4 epochs) train(phase3_pr_mastery, epochs=4, lr=5e-6) ``` ## 🎓 Curriculum Learning Benefits - **Progressive complexity**: Start simple, increase difficulty - **Better convergence**: 25-40% improvement over random training - **Domain adaptation**: Learn repository-specific patterns - **Code understanding**: Syntax → Changes → Collaboration - **Efficient training**: Focused learning objectives per phase ## 📝 Technical Details ### Repository - **Source**: [Hyperswitch](https://github.com/juspay/hyperswitch) - **Language**: Primarily Rust - **Domain**: Payment processing, financial technology - **Components**: Connectors, API models, routing logic, state machines ### Data Collection - **Files**: Pattern-based extraction (Rust, TOML, YAML, JSON, Markdown) - **Commits**: Full git history from repository inception - **PRs**: Merged and closed PRs with reviews and comments via GitHub API - **Tests**: Automatic pairing of test files with implementations ## 🔧 Sequence Length Flexibility This unbroken dataset works with any sequence length: | Sequence Length | Use Case | Chunking Strategy | |----------------|----------|-------------------| | 8K tokens | Base models | Chunk with overlap | | 16K tokens | Extended context | Fewer chunks needed | | 32K tokens | Long context models | Most files fit whole | | 64K+ tokens | Ultra-long context | Complete commits/PRs | ## 🙏 Acknowledgments - **Hyperswitch Team** at Juspay for the amazing open-source payment processing platform - Dataset curated and organized by **Aditya Narayan** - Dataset generated using custom extraction pipeline with curriculum organization ## 📧 Contact & Citation If you use this dataset, please cite: ```bibtex @dataset{hyperswitch_curriculum2025, title = {AdityaNarayan/HS-Repo-Curriculum-Learning}, author = {Aditya Narayan}, year = {2025}, url = {https://huggingface.co/datasets/AdityaNarayan/HS-Repo-Curriculum-Learning}, publisher = {HuggingFace}, note = {Dataset derived from Hyperswitch repository} } ```
提供机构:
AdityaNarayan
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作