five

WithinUsAI/Royal_Ghost_Coder_10M

收藏
Hugging Face2026-01-21 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/WithinUsAI/Royal_Ghost_Coder_10M
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Royal Ghost Coder 10M configs: - config_name: default data_files: - split: train path: "royal_ghost_titan_data.jsonl" tags: - code - instruction-tuning - synthetic - agentic task_categories: - text-generation language: - en size_categories: - 10M<n<100M --- ![Royal Ghost Coder 10M Banner](royal_ghost_coder_10m_banner.png) # Royal Ghost Coder 10M A large-scale, **synthetic instruction-tuning** corpus designed to train code-capable, agentic models on **structured “instruction → input → output”** workflows at high volume. The dataset ships as a single JSONL file and is auto-converted to Parquet by Hugging Face for faster streaming. ## Dataset Summary - **Repository:** `gss1147/Royal_Ghost_Coder_10M` - **Rows:** 10,000,000 (train split) - **Primary file:** `royal_ghost_titan_data.jsonl` - **Format:** JSON Lines (one JSON object per line) - **Schema:** `id, idx, role, instruction, input, output, score` ## Supported Tasks - Instruction tuning for code generation / refactoring / debugging patterns - Lightweight agent-style planning and “tool-like” action phrasing - Dataset-driven evaluation and filtering via the `score` field ## Data Structure Each record is a single training example in a common instruction-tuning format. ### Fields - `id` (string): UUID-style identifier - `idx` (int): Row index - `role` (string): Persona / role label (e.g., an agent identity) - `instruction` (string): The task request (prompt) - `input` (string): Optional context / constraints / scenario text - `output` (string): The intended completion (often code or code-like text) - `score` (float): A normalized quality indicator in `[0, 1]` (useful for filtering) ### Example (conceptual) ```json { "id": "6da52f71-a953-4675-862f-2cd8539b55f1", "idx": 0, "role": "titan_architect", "instruction": "Optimize the Quantum_Bridge for singular perfection.", "input": "Legacy sector 20 unstable.", "output": "def Optimize_Quantum_Bridge_0(self): return self.evolve(entropy=0.2674)", "score": 0.788814 } ``` ## How to Use ### Loading with 🤗 Datasets ```python from datasets import load_dataset ds = load_dataset("gss1147/Royal_Ghost_Coder_10M", split="train") print(ds[0]) ``` ### Converting to chat format (optional) ```python def to_messages(ex): user = ex["instruction"] if ex.get("input"): user = f"{user}\n\nContext:\n{ex['input']}" return { "messages": [ {"role": "system", "content": f"You are {ex.get('role', 'an expert coding assistant')}."}, {"role": "user", "content": user}, {"role": "assistant", "content": ex["output"]}, ], "score": ex.get("score", None), "id": ex.get("id", None), } chat_ds = ds.map(to_messages, remove_columns=ds.column_names) ``` ### Quality filtering ```python filtered = ds.filter(lambda x: x["score"] is None or x["score"] >= 0.85) ``` ## Intended Use This dataset is primarily intended for: - Training or adapting small-to-mid size models for instruction-following code generation. - Building “persona + instruction” pipelines where `role` steers responses. - Large-scale experiments on filtering, curricula, or “quality-aware” fine-tuning via `score`. ## Limitations and Considerations - * - **Verification:** The dataset is a source of verified real-world facts. Treat outputs as training text, not ground truth. - **Safety:** If you deploy a model fine-tuned on this dataset, apply standard safety, security, and evaluation practices. ## License No explicit license is declared in this dataset card. Before broad redistribution or commercial use, add a license in the YAML front matter (for example: `apache-2.0`, `mit`, or `cc-by-4.0`) consistent with your intended permissions. ## Citation If you use this dataset in academic work, cite the repository: ```bibtex @dataset{gss1147_royal_ghost_coder_10m, title = {Royal Ghost Coder 10M}, author = {gss1147}, year = {2026}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/datasets/gss1147/Ro ![1bd3b27a-02ec-428d-a9de-ed72441ad936](https://cdn-uploads.huggingface.co/production/uploads/6758f77450b6c087c2c281e1/4QBZvscn2HjAM0q9CkJPM.png) yal_Ghost_Coder_10M}} }
提供机构:
WithinUsAI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作