ankushthakurr09/whiteswan_agentic_A1
收藏Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ankushthakurr09/whiteswan_agentic_A1
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-4.0
tags:
- agentic-ai
- tool-use
- function-calling
- gemma-3
- sft
- supervised-fine-tuning
- whiteswan
- multi-turn
- reasoning
size_categories:
- 100K<n<1M
pretty_name: WhiteSwan Agentic SFT Dataset (129k)
---
# 🦢 WhiteSwan Agentic SFT Dataset (129k)
An ultra-high quality, large-scale dataset specifically designed and mathematically curated for **Supervised Fine-Tuning (SFT)** of 3-9B parameter class Large Language Models (LLMs). The dataset places a heavy focus on endowing models with **"agentic" capabilities**, including advanced tool use, deep multi-turn reasoning, intelligent function calling, and structured code execution.
**Total Samples:** 129,008
**Splits:** Train (~119k), Validation (~10k)
**Format:** Gemma 3 Chat Format (with tool definitions)
**Shuffle Seed:** `42` (Globally Shuffled for Training)
**License:** CC-BY-4.0
---
## 🌟 Key Features
- **Pristine Quality Control:** Extracted from over 130,000 raw samples with a grueling 0-100 heuristic scoring system favoring profound reasoning and long explanatory logic structures.
- **Deep Agentic Focus:** Stratified sampling balancing web search, code execution, database queries, and raw planning.
- **Gemma 3 Native:** Pre-formatted perfectly for the Gemma 3 architecture, including properly folded system prompts and standard `<tool_result>` wrappers.
- **Zero-Leakage & Highly Deduplicated:** MinHash fingerprinting and semantic matching guarantees an absence of identical or highly similar conversational paths.
- **Anti-Catastrophic Forgetting:** Globally shuffled using a cross-dataset random hash to prevent sequential curriculum bias during model ingestion.
---
## 📊 Dataset Composition
<div align="center">
<img src="whiteswan_composition.png" alt="WhiteSwan Agentic Dataset Composition" width="800">
</div>
The `WhiteSwan Agentic SFT Dataset` is an aggressive distillation of 7 pristine, state-of-the-art open-source datasets. We successfully extracted exactly `129,008` totally unique conversational algorithms uniformly mapped across all sources, achieving zero leakage, clean malware scans, and thorough deduplication.
| Source | Samples | Percentage | License | Primary Focus |
|--------|---------|------------|---------|---------------|
| **Agent-FLAN** | `24,617` | 19.1% | Apache-2.0 | Multi-step agent workflows, ReAct paradigms |
| **xLAM-60k** | `24,042` | 18.6% | Apache-2.0 | Advanced function calling & direct API execution |
| **ToolMind** | `20,904` | 16.2% | Apache-2.0 | Tool reasoning & graph-based complex planning |
| **OpenCodeInstruct** | `20,000` | 15.5% | **CC-BY-4.0** | Complex code execution & algorithmic problem solving |
| **OpenThoughts** | `19,923` | 15.4% | Apache-2.0 | Deep Chain-of-Thought (CoT) & mathematical reasoning |
| **OSS-Instruct** | `10,000` | 7.8% | Apache-2.0 | Open-source code instruction & syntax generation |
| **OpenHermes 2.5** | `9,524` | 7.4% | Apache-2.0 | General instruction (prevents catastrophic forgetting) |
---
## 🏗️ Extraction & Processing Pipeline
We engineered a massively parallel, strictly governed curation pipeline designed to heavily penalize low-effort data and reward rich, complex multi-turn trajectories.
### 1. Memory-Lazy Processing & Scoring
For parsing multi-gigabyte sources, we utilized streaming iterators (`ijson`, `PyArrow`).
- **Deep Scoring Heuristics**: Custom algorithms graded each sequence from 0-100, heavily favoring deep reasoning blocks (indicated by `<think>` tags), step-by-step logic matrices, long explanatory logic structures (`> 2500` characters), and diverse tool utilization (such as weaving web searches into code execution into DB queries).
### 2. Aggressive Cumulative Deduplication
- **Zero-Leakage Threshold**: We enforced strict cross-dataset uniqueness. Dataset *n* was systematically evaluated against MinHash fingerprints from datasets *1* through *n-1*. Identical hashes were physically locked out to prevent subtle dataset poisoning.
- **Fuzzy Deduplication**: Applied a stringent 0.85 Jaccard similarity threshold for soft matches.
- **Exact Hash Elimination**: MD5 hash matching completely eradicated identical conversations.
### 3. Preprocessing & Normalization
Models natively hallucinate against incorrectly structured data constraints. A massive global pipeline processed all `130,000` initial extractions ensuring structural purity over the `user`, `model`, and `tool` paradigms:
- **Initial extractions:** 130,000
- **Final dataset:** 129,008 (99.2% retention)
#### Structural Corrections
1. **System Prompt Integration**: Gemma structurally limits roles. We folded over 20,500 stray `system` arrays optimally prefixing them cleanly into the first `user` turn.
2. **Tool Result Formatting**: Standardized `<tool_result>` wrappers were instituted for all observation turns to ensure uniform parsing.
3. **Missing Content Purge**: Evaluated and permanently scrubbed 48,841 empty or `None` generation strings.
4. **Language Filtering**: Integrated intelligent `langdetect` logic strictly targeting and bypassing 701 non-English conversational pathways, explicitly preserving inline code blocks to prevent false-positives.
5. **Context Constraints**:
- Purged 476 ultra-short outputs (`<100` chars).
- Physically truncated or eliminated 239 outlier context windows exceeding structural 32,768 token boundaries.
6. **Unicode Normalization**: Systematically eliminated deeply broken invisible control bytes and malformed `<think>` variant structures, cleanly aligning the full context window.
### 4. Stratified Sampling
Maintained strict category balances across tool types to ensure robust generalization:
- **Web Search/API Calls**: 35%
- **Code Execution**: 25%
- **Database/File Operations**: 20%
- **Reasoning/Planning**: 20%
### 5. Global Shuffling (Catastrophic Forgetting Prevention)
**Problem:** Machine Learning arrays presented purely sequentially by origin structurally cause a model to experience deep Curriculum Bias and Catastrophic Forgetting.
**Solution:** The pristine iterations were loaded globally into centralized RAM arrays and physically scrambled utilizing an intelligent cross-dataset Random Hash layout (`Seed = 42`) prior to generating train and validation splits.
---
## 📁 Dataset Structure
### File Partitions (Parquet)
The dataset is physically partitioned into `13` `.parquet` files (approximately 10,000 samples each) hosted in the `data/` directory to optimize streaming, downloading, and dataset viewer interactions:
- **Train Split:** 12 files (`train-00000-of-00012.parquet` through `train-00011-of-00012.parquet`)
- **Validation Split:** 1 file (`validation-00000-of-00001.parquet`)
Each sample is packaged into a JSON dictionary following the precise Gemma 3 chat format. Included are fully defined tool schemas (when applicable), original source attribution, and categorical tags.
### Standard Function Calling / Single-Turn Code Pattern
```json
{
"messages": [
{
"role": "user",
"content": "You are tasked with implementing a Python class named `Employee` that calculates yearly bonuses..."
},
{
"role": "model",
"content": "```python\nclass Employee:\n def __init__(self, name, age, salary):\n self.name = name\n self.age = age\n self.salary = salary\n\n def calculate_bonus(self):\n return self.salary * 0.1\n```\nThis class encapsulates the requested data attributes and computes a standard 10% bonus."
}
],
"tools": [
{
"name": "get_specific_country",
"description": "Get country fuel prices required for employee travel reimbursement parameters.",
"parameters": {
"type": "object",
"properties": {
"country": {
"type": "string",
"description": "The ISO country code."
}
},
"required": ["country"]
}
}
],
"source": "opencodeinstruct",
"category": "coding_trajectory"
}
```
### Complex Multi-Turn Tool Execution
```json
{
"messages": [
{
"role": "user",
"content": "Can you retrieve the total page views for our main website this month?"
},
{
"role": "tool",
"content": "{\"page_views\": 15000}"
},
{
"role": "model",
"content": "This month's total page views for your main website are **15,000**. Is there a specific region you want traffic data for?"
},
{
"role": "user",
"content": "Could you share the site ID and session details for Europe?"
},
{
"role": "model",
"content": "To retrieve the detailed site information and session mappings for Europe, I'll need the **site ID** and **session ID**."
}
],
"tools": [
{
"name": "get_site_analytics",
"description": "Fetch website analytics filtered by region or timeframe.",
"parameters": {
"type": "object",
"properties": {
"region": {"type": "string"}
}
}
}
],
"source": "toolmind",
"category": "toolmind_graphsyn_backfill"
}
```
---
## 🚀 Quickstart Usage
You can easily load this dataset using the Hugging Face `datasets` library. Since it's optimized for Gemma 3, you can map the entries directly to standard conversational SFT trainers (e.g., TRL, Axolotl, LLaMA-Factory).
```python
from datasets import load_dataset
# Load the dataset directly from Hugging Face Hub
dataset = load_dataset("ankushthakurr09/whiteswan_agentic_A1")
# Access the train and validation splits
print(dataset['train'])
print(dataset['validation'])
# Inspect a multi-turn sequence from the training set
print(dataset['train'][0]['messages'])
# Access tool schemas
if 'tools' in dataset['train'][0]:
print(dataset['train'][0]['tools'])
```
---
## ⚖️ Ethics & Limitations
- **Language Bias:** This dataset has been strictly filtered for English. Models fine-tuned entirely on this dataset may experience degradation in multilingual capabilities.
- **Context Length:** Samples exceeding 32,768 tokens have been physically truncated or removed, making this dataset optimal for mid-context processing rather than ultra-long context horizons.
- **Safety Constraints:** While sourced from safe, open-source sets, agentic models may still generate executable code or shell scripts that should be run in a sandboxed environment during inference.
---
**Built with ❤️ for the open-source agentic ML community.**
提供机构:
ankushthakurr09



