five

otavio-lemos/oci-copilot-jr-dataset

收藏
Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/otavio-lemos/oci-copilot-jr-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language: - pt language_creators: - machine-generated license: - mit multilingual: - pt pretty_name: OCI Copilot Jr Dataset size_categories: - 10K<n<100K source_datasets: - original tags: - oracle-cloud-infrastructure - oci - fine-tuning - copilot - mlx - apple-silicon --- # Dataset Card: OCI Copilot Jr Dataset ## Overview This dataset contains **13,196 examples** of high-quality training data for fine-tuning a Large Language Model to become the **best knowledgeable** in Oracle Cloud Infrastructure (OCI) — the "OCI Copilot Jr". The dataset was **synthetically generated** using prompt templates with OCI CLI commands and real-world enterprise scenarios in Brazilian Portuguese (PT-BR). | Split | Examples | Percentage | |-------|----------|-------------| | Train | 9,897 | 75% | | Valid | 1,979 | 15% | | Eval | 1,320 | 10% | | **Total** | **13,196** | 100% | ## Dataset Structure ### Schema (Chat Format) ```json { "messages": [ { "role": "system", "content": "Você é um arquiteto e especialista experiente em OCI focado no domínio de {category}. Forneça orientações técnicas, profundas e definitivas." }, { "role": "user", "content": "Para o ambiente {environment} do nosso projeto {project}, precisamos realizar: {task}. Quais as melhores estratégias e comandos no OCI considerando a restrição: {restriction}?" }, { "role": "assistant", "content": "## {task} — OCI Step-by-Step\n\n**Cenário**: {company}, projeto {project}, ambiente {environment}\n\n[detailed technical response with OCI CLI commands, Terraform, and best practices]" } ] } ``` ### Categories (88 OCI Domains) | Pillar | Categories | |--------|------------| | **Compute** | instances, custom-images, scaling | | **Container** | instances, OKE | | **Database** | autonomous, autonomous-json, exadata, exadata-cloud, MySQL, NoSQL, PostgreSQL | | **DevOps** | artifacts, CI/CD, resource-manager, secrets | | **FinOps** | cost-optimization, rightsizing, showback-chargeback, storage-tiering | | **Governance** | audit-readiness, budgets-cost, compartments, compliance, landing-zone, policies-guardrails, resource-discovery, tagging | | **Load Balancer** | load-balancer | | **Migration** | aws-database, azure-compute, azure-database, azure-storage, data-transfer, gcp-compute, gcp-database, gcp-storage, onprem-compute, onprem-database, onprem-storage, onprem-vmware | | **Networking** | connectivity, security, VCN | | **Observability** | APM, logging, monitoring, stack-monitoring | | **Platform** | backup-governance, SRE-operations | | **Security** | cloud-guard, dynamic-groups, encryption, federation, IAM-basics, policies, posture-management, vault-keys, vault-secrets, WAF, zero-trust | | **Serverless** | api-gateway, functions | | **Storage** | block, file, object | | **Terraform** | compute, container, database, devops, load-balancer, networking, observability, provider, security, serverless, state, storage | | **Troubleshooting** | authentication, compute, connectivity, database, functions, OKE, performance, storage | ## Data Generation Pipeline ```mermaid flowchart LR A["generate_v7_combined.py\n(88 cats × 150 ex)"] --> B["validate_jsonl.py"] B --> C["clean_dataset.py"] C --> D["dedupe_embedding.py\n(threshold 0.97)"] D --> E["build_dataset_fixed.py\n(75/15/10 split)"] E --> F["train.jsonl\nvalid.jsonl\neval.jsonl"] ``` ### Generation Process 1. **Template-based generation**: Uses prompt templates with varied: - Company names (realistic Brazilian enterprises) - Project names - Environments (greenfield, brownfield, production, staging) - Personas (SRE, Platform Engineer, FinOps Analyst, Architect) - Restrictions (budget-limited, no-downtime, rollback-15min, etc.) - Regions and compartments 2. **Quality Validation**: - JSONL schema validation - Content cleaning (removes generic templates, incorrect CLI) - Semantic deduplication using embeddings (threshold 0.97) ### Token Statistics | Metric | Value | |--------|-------| | Average tokens/example | 883 | | Min tokens | 410 | | Max tokens | 934 | ## Fine-Tuning Results After fine-tuning **Qwen 2.5 Coder 7B Instruct** (4-bit) with LoRA on this dataset, the model achieved significant improvements: ### External Judge Evaluation (mlx-community/Meta-Llama-3.1-8B-Instruct-4bit) - 200 samples | Metric | Base Model | Fine-Tuned | Delta | |--------|-------------|------------|-------| | technical_correctness | 3.00 | 3.73 | **+0.72** | | depth | 3.06 | 3.82 | **+0.76** | | structure | 3.50 | 4.63 | **+1.14** | | hallucination | 3.62 | 4.46 | **+0.84** | | clarity | 3.20 | 3.98 | **+0.77** | | **Overall** | **3.27** | **4.12** | **+0.85** | ### Top Gains by Topic 1. **storage/object**: +3.60 2. **troubleshooting/performance**: +3.80 3. **observability/apm**: +3.40 4. **security/dynamic-groups**: +3.40 5. **database/postgresql**: +3.40 ### Model Files | Resource | URL | |----------|-----| | **Safetensors** | https://huggingface.co/otavio-lemos/oci-copilot-jr-safetensors | | **GGUF** | https://huggingface.co/otavio-lemos/oci-copilot-jr-gguf | ## Use and Limitations ### Intended Use This dataset is designed for: - Fine-tuning LLMs for Oracle Cloud Infrastructure (OCI) operations - Training technical assistants specialized in OCI CLI, Terraform, and best practices - Building domain-specific RAG systems for cloud operations ### Limitations - **Language**: Only Brazilian Portuguese (PT-BR) - **Generated data**: Not human-annotated, may contain occasional inaccuracies - **Knowledge cutoff**: Based on OCI documentation available up to April 2026 - **Scope**: Focus on operational tasks (not development/architecture planning) ## Citation ```bibtex @dataset{lemos_2026_oci_copilot_jr, author = {Otavio Lemos}, title = {OCI Copilot Jr Dataset}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/otavio-lemos/oci-copilot-jr-dataset} } ``` ## License MIT License - See [LICENSE](https://github.com/otavio-lemos/olia-2-oci/blob/main/LICENSE) --- *Dataset generated using MLX-Tune pipeline on Apple Silicon M3 Pro*
提供机构:
otavio-lemos
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作