merileijona/quantum-circuits-21k
收藏Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/merileijona/quantum-circuits-21k
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: train
path: augmented_circuits_v2.json
- config_name: augmented_v2
data_files:
- split: train
path: augmented_circuits_v2.json
- config_name: master_v2
data_files:
- split: train
path: master_circuits_v2_final.json
license: mit
task_categories:
- text-generation
language:
- en
tags:
- quantum-computing
- qasm
- openqasm
- quantum-circuits
- synthetic
- code-generation
- quantum-machine-learning
size_categories:
- 10K<n<100K
---
# Quantum Circuits Dataset — v2 (21K)
A synthetic dataset of validated natural language → OpenQASM 2.0 circuit pairs for training quantum circuit generation models. **To our knowledge the largest publicly available dataset of validated NL→QASM pairs specifically designed for generative model training.**
Used to train the [QuantumGPT-124M](https://huggingface.co/merileijona/quantumgpt-124m) model series.
---
## Quick Start
```python
from datasets import load_dataset
# v2 training set (21K samples, recommended)
ds = load_dataset("merileijona/quantum-circuits-21k")
# or equivalently:
ds = load_dataset("merileijona/quantum-circuits-21k", "augmented_v2")
# v2 base circuits only (1,928 unique circuits)
ds = load_dataset("merileijona/quantum-circuits-21k", "master_v2")
# v1 original dataset (8K samples) — separate repo
ds = load_dataset("merileijona/quantum-circuits-8k")
```
---
## Dataset Configs
| Config | File | Samples | Description |
|---|---|---|---|
| `default` / `augmented_v2` | `augmented_circuits_v2.json` | 21,208 | **v2 training corpus** — use this for training |
| `master_v2` | `master_circuits_v2_final.json` | 1,928 | v2 unique base circuits with full metadata |
---
## Dataset Versions
### v2 — quantum-circuits-21k (February 2026)
**21,208 samples | 1,928 unique circuits | 92 categories | 100% QASM-valid**
Expanded from v1 through:
- Higher variant counts per category (5–10 → 20–40 variants)
- Increased variant counts per category from 5–10 to 20–40, deepening coverage across 16 subcategory families:
- **Quantum simulation** — Ising model, Heisenberg chains, Trotterised evolution
- **Hardware-native gate sets** — IBM (SX-RZ-CNOT), Rigetti (Rx-Rz-CZ), IonQ (native XX)
- **Noise mitigation** — Pauli twirling, dynamical decoupling, zero-noise extrapolation
- **Clifford+T circuits** — T-count optimisation, Clifford group circuits
- **Connectivity-aware** — SWAP networks, heavy-hex routing, bridge gates
- **Large-qubit algorithms** — 5–6 qubit QFT, Grover, phase estimation
- **Quantum walks, amplitude amplification, block encoding**
Generation used a hardened system prompt with explicit qelib1.inc gate allowlist, chunked batch generation (≤15 circuits/call), and inline qiskit validation rejecting all syntactically invalid circuits at generation time.
### v1 — quantum-circuits-8k (February 2026)
The original v1 dataset (8,129 samples, 739 unique circuits) is hosted separately at
[merileijona/quantum-circuits-8k](https://huggingface.co/datasets/merileijona/quantum-circuits-8k).
---
## Schema
### augmented_v2 (training format)
```python
{
"description": "Create a Bell state using two qubits", # natural language prompt
"circuit_qasm": "OPENQASM 2.0;\ninclude \"qelib1.inc\";\n...", # target QASM circuit
"category": "bell_state_phi_plus", # circuit category
"source": "grok_generated",
"original_hash": "7a2c70fd...", # SHA-256 of base circuit
"variation": "paraphrase_3" # original | paraphrase_1..10
}
```
### master_v2 (research / analysis)
```python
{
"description": "Create a Bell state using two qubits",
"qasm": "OPENQASM 2.0;\ninclude \"qelib1.inc\";\n...",
"category": "bell_state_phi_plus",
"subcategory": "entanglement",
"qubits": 2,
"hash": "7a2c70fd...",
"source": "grok_generated"
}
```
---
## Training Format
```python
# Standard causal LM format used by QuantumGPT
formatted = f"<|user|>{sample['description']}<|end|>\n<|assistant|>{sample['circuit_qasm']}<|end|>"
```
---
## Circuit Categories (92 total)
**Single-qubit gates (14):** H, X, Y, Z, S, T, Sdg, Tdg, RX, RY, RZ, U1, U2, U3
**Two-qubit operations (11):** Bell states (Φ+, Φ−, Ψ+, Ψ−), CNOT, CZ, SWAP, iSWAP, controlled rotations
**Three-qubit operations (6):** GHZ states, W states, Toffoli, Fredkin
**Quantum algorithms (15):** Deutsch-Jozsa, Grover (1–3 iterations), QFT (2–4 qubits), phase estimation
**Variational circuits (15):** VQE ansätze, hardware-efficient ansätze, QAOA, brickwork patterns, UCCSD
**Error correction (6):** bit-flip code, phase-flip code, Steane 7-qubit, Shor 9-qubit
**Arithmetic (8):** adders, incrementers, decrementers, comparators
**Special states (6):** Dicke states, graph states, cluster states
**New in v2 (32 new subcategory families):** quantum simulation, hardware-native, noise mitigation, Clifford+T, connectivity-aware, large-qubit algorithms, quantum walks, amplitude techniques, block encoding, and more
---
## Quality Metrics
| Metric | v1 | v2 |
|---|---|---|
| QASM syntax validity | 100% | 100% |
| Duplicate rate | 0% | 0% |
| Description diversity | 99.8% unique | 99.9% unique |
| Category coverage | 92/92 | 92/92 |
| Qubit range | 1–9 | 1–9 |
---
## Benchmark Results
When used to train QuantumGPT-124M (GPT-2 architecture, 123.8M parameters, trained from scratch), evaluated on the QuantumGPT Benchmark v1.0 (100 prompts, 50 ID / 50 OOD, pass@5, seed=42):
| Model | Training Data | pass@1 syntax | pass@5 syntax | Semantic valid |
|---|---|---|---|---|
| QuantumGPT-124M-v1 | quantum-circuits-8k | 67.2% | 91.0% | 48.0% |
| QuantumGPT-124M-v2 | quantum-circuits-21k | **95.8%** | **100.0%** | **61.0%** |
Improvement is statistically significant (Fisher exact, p=0.0016). Benchmark prompt suite hash: `ee2da8a57e683af2464eb7a4eada0898`.
---
## Limitations
1. **Synthetic data** — all circuits generated by LLM (xAI Grok), not from real quantum programs
2. **OpenQASM 2.0 only** — not QASM 3.0 or hardware-native formats (though v2 includes hardware-native categories)
3. **Small circuit scale** — optimised for 1–9 qubit systems
4. **Syntactic but not semantic guarantee** — 100% QASM-parseable, but unitary correctness not verified at dataset level
---
## Related Models
- [merileijona/quantumgpt-124m](https://huggingface.co/merileijona/quantumgpt-124m) — trained on v1 (quantum-circuits-8k)
- [merileijona/quantum-circuits-8k](https://huggingface.co/datasets/merileijona/quantum-circuits-8k) — original v1 dataset
---
## Citation
```bibtex
@misc{quantum-circuits-21k,
author = {Merilehto, Juhani},
title = {Quantum Circuits Dataset: Validated NL→OpenQASM 2.0 Pairs for Generative Model Training},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/datasets/merileijona/quantum-circuits-21k},
note = {v1: 8,129 samples; v2: 21,208 samples across 92 circuit categories}
}
```
---
## License
MIT License
## Acknowledgments
- Circuit generation: xAI Grok API
- Syntax validation: Qiskit OpenQASM 2.0 parser
- Training framework: nanoGPT / nanochat (Andrej Karpathy)
- Affiliation: University of Vaasa; University of Turku
---
*From the uncertainty of data, the Machine Spirit guides us. May the Omnissiah bless all quantum computations.* ⚛️
提供机构:
merileijona



