SJY23/PiKa-SFT-30k
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/SJY23/PiKa-SFT-30k
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: PiKa Dataset
language:
- en
size_categories:
- 10K<n<100K
tags:
- synthetic
- alignment
- post-training
- sft
- llm
task_categories:
- text-generation
configs:
- config_name: default
data_files:
- split: train
path: PiKa-SFT-30k.json
---
# PiKa Dataset
Official dataset for:
**PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch**
PiKa is a 30K GPT-4o-generated expert-level dataset for post-training alignment.
## Data Format
Each example contains:
- `instruction`
- `chosen`
## Results
### Table 1
Prompt difficulty comparison on AlpacaEval 2. We compare PiKa variants with different difficulty levels and show that the expert setting delivers the strongest alignment performance.
| Dataset | Difficulty | AlpacaEval 2 LC (%) | WR (%) |
| --- | ---: | ---: | ---: |
| MAGPIE-Pro | 2.65 | 15.42 | 16.89 |
| PiKa-Series (10K Subset), w/o Persona-Guide | 3.11 | 13.84 | 15.53 |
| PiKa-Series (10K Subset), Low-Diff | 2.91 | 21.86 | 14.95 |
| PiKa-Series (10K Subset), Mid-Diff | 3.64 | 24.36 | 17.84 |
| **PiKa-Series (10K Subset), Expert (Default)** | **7.39** | **31.01** | **30.32** |
### Table 2
Performance comparison of instruction-tuned models based on Llama-3-8B-Base using PiKa-generated versus baseline datasets. PiKa achieves superior performance while requiring 10x less training data than state-of-the-art MAGPIE methods.
| Alignment Setup (Base LLM = Llama-3-8B-Base) | #Convs | AlpacaEval 2 LC (%) | Arena-Hard WR (%) |
| --- | ---: | ---: | ---: |
| Llama-3-8B-Instruct (Official) | >10M | 28.36 | 24.5 |
| Self-Instruct (Llama-3) (Wang et al., 2023) | 100K | 8.86 | 3.3 |
| ShareGPT (Chiang et al., 2023) | 112K | 6.98 | 6.9 |
| Ultrachat (Ding et al., 2023) | 208K | 6.70 | 3.6 |
| OpenHermes 1 (Teknium, 2023a) | 243K | 8.69 | 5.3 |
| Tulu V2 Mix (Ivison et al., 2023) | 326K | 10.95 | 6.3 |
| WildChat (Zhao et al., 2024) | 652K | 14.75 | 11.7 |
| OpenHermes 2.5 (Teknium, 2023b) | 1M | 12.40 | 7.7 |
| MAGPIE-Air-300K-Filtered (Xu et al., 2025) | 300K | 25.24 | 20.7 |
| MAGPIE-Pro-300K-Filtered (Xu et al., 2025) | 300K | 24.06 | 23.9 |
| **PiKa (Ours)** | **30K** | **32.82** | **33.5** |
### Table 3
Performance comparison on additional downstream objective tasks from the Open LLM Leaderboard. The goal of this evaluation is to assess whether alignment with PiKa preserves performance on objective tasks rather than optimizing only for alignment benchmarks. All models are supervised fine-tuned on Llama-3-8B-Base. Numbers in parentheses indicate the number of few-shot examples.
| Alignment Setup | MMLU (5) | ARC (25) | HellaSwag (10) | TruthfulQA (0) | WinoGrande (5) | GSM8K (5) | Average |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| Llama-3-8B-Instruct | 67.82 | 61.52 | 78.67 | 52.47 | 72.14 | 71.72 | 67.39 |
| ShareGPT | 66.03 | 58.45 | 81.50 | 52.34 | 74.03 | 48.67 | 63.50 |
| OpenHermes 1 | 65.42 | 62.29 | 82.15 | 50.85 | 75.61 | 47.16 | 63.58 |
| OpenHermes 2.5 | 65.70 | 61.86 | 82.53 | 51.35 | 76.09 | 67.02 | 67.09 |
| Tulu V2 Mix | 66.34 | 59.22 | 82.80 | 47.99 | 76.16 | 58.07 | 65.10 |
| WildChat | 65.95 | 59.22 | 81.39 | 53.18 | 75.30 | 48.75 | 63.97 |
| UltraChat | 65.23 | 62.12 | 81.68 | 52.76 | 75.53 | 50.57 | 64.65 |
| MAGPIE-Air-300K-Filtered | 64.45 | 61.01 | 79.90 | 53.48 | 72.38 | 52.24 | 63.58 |
| MAGPIE-Pro-300K-Filtered | 64.25 | 60.41 | 80.52 | 52.46 | 73.32 | 47.92 | 63.15 |
| PiKa | 62.85 | 59.98 | 80.02 | 52.48 | 73.01 | 52.84 | 63.53 |
## Citation
If you use this dataset, please cite our paper:
```bibtex
@misc{yin2025pikaexpertlevelsyntheticdatasets,
title={PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch},
author={Shangjian Yin and Shining Liang and Wenbiao Ding and Yuli Qian and Zhouxing Shi and Hongzhi Li and Yutao Xie},
year={2025},
eprint={2510.06670},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.06670},
}
```
提供机构:
SJY23



