five

ankushthakurr09/whiteswan_agentic_A1

收藏
Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ankushthakurr09/whiteswan_agentic_A1
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: cc-by-4.0 tags: - agentic-ai - tool-use - function-calling - gemma-3 - sft - supervised-fine-tuning - whiteswan - multi-turn - reasoning size_categories: - 100K<n<1M pretty_name: WhiteSwan Agentic SFT Dataset (129k) --- # 🦢 WhiteSwan Agentic SFT Dataset (129k) An ultra-high quality, large-scale dataset specifically designed and mathematically curated for **Supervised Fine-Tuning (SFT)** of 3-9B parameter class Large Language Models (LLMs). The dataset places a heavy focus on endowing models with **"agentic" capabilities**, including advanced tool use, deep multi-turn reasoning, intelligent function calling, and structured code execution. **Total Samples:** 129,008 **Splits:** Train (~119k), Validation (~10k) **Format:** Gemma 3 Chat Format (with tool definitions) **Shuffle Seed:** `42` (Globally Shuffled for Training) **License:** CC-BY-4.0 --- ## 🌟 Key Features - **Pristine Quality Control:** Extracted from over 130,000 raw samples with a grueling 0-100 heuristic scoring system favoring profound reasoning and long explanatory logic structures. - **Deep Agentic Focus:** Stratified sampling balancing web search, code execution, database queries, and raw planning. - **Gemma 3 Native:** Pre-formatted perfectly for the Gemma 3 architecture, including properly folded system prompts and standard `<tool_result>` wrappers. - **Zero-Leakage & Highly Deduplicated:** MinHash fingerprinting and semantic matching guarantees an absence of identical or highly similar conversational paths. - **Anti-Catastrophic Forgetting:** Globally shuffled using a cross-dataset random hash to prevent sequential curriculum bias during model ingestion. --- ## 📊 Dataset Composition <div align="center"> <img src="whiteswan_composition.png" alt="WhiteSwan Agentic Dataset Composition" width="800"> </div> The `WhiteSwan Agentic SFT Dataset` is an aggressive distillation of 7 pristine, state-of-the-art open-source datasets. We successfully extracted exactly `129,008` totally unique conversational algorithms uniformly mapped across all sources, achieving zero leakage, clean malware scans, and thorough deduplication. | Source | Samples | Percentage | License | Primary Focus | |--------|---------|------------|---------|---------------| | **Agent-FLAN** | `24,617` | 19.1% | Apache-2.0 | Multi-step agent workflows, ReAct paradigms | | **xLAM-60k** | `24,042` | 18.6% | Apache-2.0 | Advanced function calling & direct API execution | | **ToolMind** | `20,904` | 16.2% | Apache-2.0 | Tool reasoning & graph-based complex planning | | **OpenCodeInstruct** | `20,000` | 15.5% | **CC-BY-4.0** | Complex code execution & algorithmic problem solving | | **OpenThoughts** | `19,923` | 15.4% | Apache-2.0 | Deep Chain-of-Thought (CoT) & mathematical reasoning | | **OSS-Instruct** | `10,000` | 7.8% | Apache-2.0 | Open-source code instruction & syntax generation | | **OpenHermes 2.5** | `9,524` | 7.4% | Apache-2.0 | General instruction (prevents catastrophic forgetting) | --- ## 🏗️ Extraction & Processing Pipeline We engineered a massively parallel, strictly governed curation pipeline designed to heavily penalize low-effort data and reward rich, complex multi-turn trajectories. ### 1. Memory-Lazy Processing & Scoring For parsing multi-gigabyte sources, we utilized streaming iterators (`ijson`, `PyArrow`). - **Deep Scoring Heuristics**: Custom algorithms graded each sequence from 0-100, heavily favoring deep reasoning blocks (indicated by `<think>` tags), step-by-step logic matrices, long explanatory logic structures (`> 2500` characters), and diverse tool utilization (such as weaving web searches into code execution into DB queries). ### 2. Aggressive Cumulative Deduplication - **Zero-Leakage Threshold**: We enforced strict cross-dataset uniqueness. Dataset *n* was systematically evaluated against MinHash fingerprints from datasets *1* through *n-1*. Identical hashes were physically locked out to prevent subtle dataset poisoning. - **Fuzzy Deduplication**: Applied a stringent 0.85 Jaccard similarity threshold for soft matches. - **Exact Hash Elimination**: MD5 hash matching completely eradicated identical conversations. ### 3. Preprocessing & Normalization Models natively hallucinate against incorrectly structured data constraints. A massive global pipeline processed all `130,000` initial extractions ensuring structural purity over the `user`, `model`, and `tool` paradigms: - **Initial extractions:** 130,000 - **Final dataset:** 129,008 (99.2% retention) #### Structural Corrections 1. **System Prompt Integration**: Gemma structurally limits roles. We folded over 20,500 stray `system` arrays optimally prefixing them cleanly into the first `user` turn. 2. **Tool Result Formatting**: Standardized `<tool_result>` wrappers were instituted for all observation turns to ensure uniform parsing. 3. **Missing Content Purge**: Evaluated and permanently scrubbed 48,841 empty or `None` generation strings. 4. **Language Filtering**: Integrated intelligent `langdetect` logic strictly targeting and bypassing 701 non-English conversational pathways, explicitly preserving inline code blocks to prevent false-positives. 5. **Context Constraints**: - Purged 476 ultra-short outputs (`<100` chars). - Physically truncated or eliminated 239 outlier context windows exceeding structural 32,768 token boundaries. 6. **Unicode Normalization**: Systematically eliminated deeply broken invisible control bytes and malformed `<think>` variant structures, cleanly aligning the full context window. ### 4. Stratified Sampling Maintained strict category balances across tool types to ensure robust generalization: - **Web Search/API Calls**: 35% - **Code Execution**: 25% - **Database/File Operations**: 20% - **Reasoning/Planning**: 20% ### 5. Global Shuffling (Catastrophic Forgetting Prevention) **Problem:** Machine Learning arrays presented purely sequentially by origin structurally cause a model to experience deep Curriculum Bias and Catastrophic Forgetting. **Solution:** The pristine iterations were loaded globally into centralized RAM arrays and physically scrambled utilizing an intelligent cross-dataset Random Hash layout (`Seed = 42`) prior to generating train and validation splits. --- ## 📁 Dataset Structure ### File Partitions (Parquet) The dataset is physically partitioned into `13` `.parquet` files (approximately 10,000 samples each) hosted in the `data/` directory to optimize streaming, downloading, and dataset viewer interactions: - **Train Split:** 12 files (`train-00000-of-00012.parquet` through `train-00011-of-00012.parquet`) - **Validation Split:** 1 file (`validation-00000-of-00001.parquet`) Each sample is packaged into a JSON dictionary following the precise Gemma 3 chat format. Included are fully defined tool schemas (when applicable), original source attribution, and categorical tags. ### Standard Function Calling / Single-Turn Code Pattern ```json { "messages": [ { "role": "user", "content": "You are tasked with implementing a Python class named `Employee` that calculates yearly bonuses..." }, { "role": "model", "content": "```python\nclass Employee:\n def __init__(self, name, age, salary):\n self.name = name\n self.age = age\n self.salary = salary\n\n def calculate_bonus(self):\n return self.salary * 0.1\n```\nThis class encapsulates the requested data attributes and computes a standard 10% bonus." } ], "tools": [ { "name": "get_specific_country", "description": "Get country fuel prices required for employee travel reimbursement parameters.", "parameters": { "type": "object", "properties": { "country": { "type": "string", "description": "The ISO country code." } }, "required": ["country"] } } ], "source": "opencodeinstruct", "category": "coding_trajectory" } ``` ### Complex Multi-Turn Tool Execution ```json { "messages": [ { "role": "user", "content": "Can you retrieve the total page views for our main website this month?" }, { "role": "tool", "content": "{\"page_views\": 15000}" }, { "role": "model", "content": "This month's total page views for your main website are **15,000**. Is there a specific region you want traffic data for?" }, { "role": "user", "content": "Could you share the site ID and session details for Europe?" }, { "role": "model", "content": "To retrieve the detailed site information and session mappings for Europe, I'll need the **site ID** and **session ID**." } ], "tools": [ { "name": "get_site_analytics", "description": "Fetch website analytics filtered by region or timeframe.", "parameters": { "type": "object", "properties": { "region": {"type": "string"} } } } ], "source": "toolmind", "category": "toolmind_graphsyn_backfill" } ``` --- ## 🚀 Quickstart Usage You can easily load this dataset using the Hugging Face `datasets` library. Since it's optimized for Gemma 3, you can map the entries directly to standard conversational SFT trainers (e.g., TRL, Axolotl, LLaMA-Factory). ```python from datasets import load_dataset # Load the dataset directly from Hugging Face Hub dataset = load_dataset("ankushthakurr09/whiteswan_agentic_A1") # Access the train and validation splits print(dataset['train']) print(dataset['validation']) # Inspect a multi-turn sequence from the training set print(dataset['train'][0]['messages']) # Access tool schemas if 'tools' in dataset['train'][0]: print(dataset['train'][0]['tools']) ``` --- ## ⚖️ Ethics & Limitations - **Language Bias:** This dataset has been strictly filtered for English. Models fine-tuned entirely on this dataset may experience degradation in multilingual capabilities. - **Context Length:** Samples exceeding 32,768 tokens have been physically truncated or removed, making this dataset optimal for mid-context processing rather than ultra-long context horizons. - **Safety Constraints:** While sourced from safe, open-source sets, agentic models may still generate executable code or shell scripts that should be run in a sandboxed environment during inference. --- **Built with ❤️ for the open-source agentic ML community.**
提供机构:
ankushthakurr09
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作