DanielRegaladoCardoso/chart-reasoning-mix-v1
收藏Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/DanielRegaladoCardoso/chart-reasoning-mix-v1
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
task_categories:
- text-generation
tags:
- chart-generation
- data-visualization
- storytelling-with-data
- chart-spec
- sql
- llm-distillation
pretty_name: Chart Reasoning Mix v1
size_categories:
- 10K<n<100K
---
# Chart Reasoning Mix v1
Training data for fine-tuning compact LLMs (Phi-3 Mini, Qwen 2.5 3B)
to map **(natural-language question + SQL result schema) to a
storytelling-grade chart specification**.
> Part of the [SQL Agent LLMOps](https://github.com/DanielRegaladoUMiami/sql-agent-llmops) project.
| Total | Sources | Storytelling fields |
|-------|---------|---------------------|
| **35,167 rows** | 2 (nvBench real + OpenAI synth) | chart_type, encoding, title, sort, color_strategy, rationale |
## Part of the SQL Agent LLMOps project
| Dataset | Model | Role |
|---------|-------|------|
| [`DanielRegaladoCardoso/text-to-sql-mix-v2`](https://huggingface.co/datasets/DanielRegaladoCardoso/text-to-sql-mix-v2) | Qwen 2.5 Coder 7B | NL question to SQL |
| **[`DanielRegaladoCardoso/chart-reasoning-mix-v1`](https://huggingface.co/datasets/DanielRegaladoCardoso/chart-reasoning-mix-v1)** | **Phi-3 Mini 3.8B** | **(question + result) to chart spec** |
| [`DanielRegaladoCardoso/svg-chart-render-v1`](https://huggingface.co/datasets/DanielRegaladoCardoso/svg-chart-render-v1) | DeepSeek Coder 1.3B | chart spec to SVG |
## Schema
Each row contains:
| Field | Type | Description |
|-------|------|-------------|
| `id` | string | Stable hash-based identifier |
| `instruction` | string | Natural-language question |
| `data_profile` | string (JSON) | SQL result column schema: name, type, sample rows |
| `chart_spec` | string (JSON) | Target chart specification (see below) |
| `source` | string | `nvbench` or `synth-openai-gpt41nano` |
| `difficulty` | string | `easy`, `medium`, `hard`, or `unknown` |
### chart_spec structure
```json
{
"chart_type": "bar|line|scatter|donut|histogram|boxplot|area|heatmap|sankey|funnel",
"encoding": {"x": "col", "y": "col", "color": "col|null", "size": "col|null", "facet": "col|null"},
"title": "Insight-driven title (not just the topic)",
"sort": {"by": "col", "order": "asc|desc|natural"},
"color_strategy": "highlight|categorical|sequential|diverging",
"rationale": "One sentence explaining why this chart type was chosen"
}
```
Note: `data_profile` and `chart_spec` are stored as JSON strings in the
parquet. Parse with `json.loads(row["chart_spec"])` after loading.
## Splits
| Split | Rows |
|-------|------|
| train | 33,408 |
| validation | 879 |
| test | 880 |
## Source attribution
This dataset combines the following sources:
| Source | Tag in `source` | Rows | License | Link | Notes |
|--------|-----------------|------|---------|------|-------|
| nvBench (Tsinghua DB Group) | `nvbench` | 24,201 | MIT | [GitHub](https://github.com/TsinghuaDatabaseGroup/nvBench) | Gold-standard NL-to-visualization benchmark. 7,247 base entries with up to 5 NL paraphrases each. Chart types: bar, line, scatter, donut. Titles backfilled from NL questions. |
| OpenAI gpt-4.1-nano synthesis | `synth-openai-gpt41nano` | 9,207 | Apache-2.0 | [text-to-sql-mix-v2](https://huggingface.co/datasets/DanielRegaladoCardoso/text-to-sql-mix-v2) | Chart specs synthesized from SQL mix v2 questions via OpenAI Batch API. System prompt distills Tufte/Knaflic/Few storytelling principles. Includes insight-driven titles and rationale. |
## Storytelling principles
The synthesis system prompt distills data-visualization best practices from:
- **Edward Tufte** -- data-ink ratio, integrity, small multiples
- **Cole Nussbaumer Knaflic** -- clutter elimination, action-driven titles
- **Stephen Few** -- perceptual encoding, dashboard hygiene
Models trained on this dataset learn:
- Correct chart type selection (from 33k examples)
- Axis encoding (which column maps to x/y/color)
- Insight-driven titles ("Sales grew 47% in Q4" not "Sales by month")
- Smart sorting (value-desc for rankings, natural for time)
- Color strategy (highlight key finding, gray background)
- Rationale (model can explain its choice)
## Pipeline
Build script: [`training/data_pipelines/build_chart_mix.py`](https://github.com/DanielRegaladoUMiami/sql-agent-llmops/blob/main/training/data_pipelines/build_chart_mix.py)
Stages: `nvbench` (load + convert) -> `synth-prepare` (sample SQL mix, build batch JSONL) ->
`synth-submit` (OpenAI Batch API) -> `synth-fetch` (download results) -> `combine-push` (merge, dedup, split, push).
Title enrichment: [`training/data_pipelines/enrich_chart_titles.py`](https://github.com/DanielRegaladoUMiami/sql-agent-llmops/blob/main/training/data_pipelines/enrich_chart_titles.py)
## Usage
```python
from datasets import load_dataset
import json
ds = load_dataset("DanielRegaladoCardoso/chart-reasoning-mix-v1")
ex = ds["train"][0]
spec = json.loads(ex["chart_spec"])
print(ex["instruction"])
print(spec["chart_type"], spec["title"])
print(spec["rationale"])
```
## Known limitations
- nvBench titles are derived from the NL question (descriptive, not insight-driven).
Only synth rows (28%) have true storytelling titles.
- `data_profile` does not include actual row data -- only column names and types.
The model cannot reason about specific values.
- Difficulty labels are heuristic, not human-judged.
## Citation
```bibtex
@dataset{regalado2026chartmix,
author = {Regalado Cardoso, Daniel},
title = {Chart Reasoning Mix v1},
year = {2026},
url = {https://huggingface.co/datasets/DanielRegaladoCardoso/chart-reasoning-mix-v1}
}
```
Plus the original nvBench citation:
```bibtex
@inproceedings{luo2021nvbench,
title = {Synthesizing Natural Language to Visualization (NL2VIS) Benchmarks from NL2SQL Benchmarks},
author = {Luo, Yuyu and Tang, Nan and Li, Guoliang and Tang, Jiawei and Chai, Chengliang and Qin, Xuedi},
booktitle = {SIGMOD},
year = {2021}
}
```
## License
Pipeline and curation: **Apache-2.0**. Row content inherits upstream licenses
(MIT for nvBench, Apache-2.0 for synth). See source attribution table.
---
_Built by Daniel Regalado Cardoso -- MSBA, University of Miami -- April 2026._
提供机构:
DanielRegaladoCardoso



