DanielRegaladoCardoso/chart-reasoning-mix-v1

Name: DanielRegaladoCardoso/chart-reasoning-mix-v1
Creator: DanielRegaladoCardoso
Published: 2026-04-16 19:21:26
License: 暂无描述

Hugging Face2026-04-16 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/DanielRegaladoCardoso/chart-reasoning-mix-v1

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: apache-2.0 task_categories: - text-generation tags: - chart-generation - data-visualization - storytelling-with-data - chart-spec - sql - llm-distillation pretty_name: Chart Reasoning Mix v1 size_categories: - 10K<n<100K --- # Chart Reasoning Mix v1 Training data for fine-tuning compact LLMs (Phi-3 Mini, Qwen 2.5 3B) to map **(natural-language question + SQL result schema) to a storytelling-grade chart specification**. > Part of the [SQL Agent LLMOps](https://github.com/DanielRegaladoUMiami/sql-agent-llmops) project. | Total | Sources | Storytelling fields | |-------|---------|---------------------| | **35,167 rows** | 2 (nvBench real + OpenAI synth) | chart_type, encoding, title, sort, color_strategy, rationale | ## Part of the SQL Agent LLMOps project | Dataset | Model | Role | |---------|-------|------| | [`DanielRegaladoCardoso/text-to-sql-mix-v2`](https://huggingface.co/datasets/DanielRegaladoCardoso/text-to-sql-mix-v2) | Qwen 2.5 Coder 7B | NL question to SQL | | **[`DanielRegaladoCardoso/chart-reasoning-mix-v1`](https://huggingface.co/datasets/DanielRegaladoCardoso/chart-reasoning-mix-v1)** | **Phi-3 Mini 3.8B** | **(question + result) to chart spec** | | [`DanielRegaladoCardoso/svg-chart-render-v1`](https://huggingface.co/datasets/DanielRegaladoCardoso/svg-chart-render-v1) | DeepSeek Coder 1.3B | chart spec to SVG | ## Schema Each row contains: | Field | Type | Description | |-------|------|-------------| | `id` | string | Stable hash-based identifier | | `instruction` | string | Natural-language question | | `data_profile` | string (JSON) | SQL result column schema: name, type, sample rows | | `chart_spec` | string (JSON) | Target chart specification (see below) | | `source` | string | `nvbench` or `synth-openai-gpt41nano` | | `difficulty` | string | `easy`, `medium`, `hard`, or `unknown` | ### chart_spec structure ```json { "chart_type": "bar|line|scatter|donut|histogram|boxplot|area|heatmap|sankey|funnel", "encoding": {"x": "col", "y": "col", "color": "col|null", "size": "col|null", "facet": "col|null"}, "title": "Insight-driven title (not just the topic)", "sort": {"by": "col", "order": "asc|desc|natural"}, "color_strategy": "highlight|categorical|sequential|diverging", "rationale": "One sentence explaining why this chart type was chosen" } ``` Note: `data_profile` and `chart_spec` are stored as JSON strings in the parquet. Parse with `json.loads(row["chart_spec"])` after loading. ## Splits | Split | Rows | |-------|------| | train | 33,408 | | validation | 879 | | test | 880 | ## Source attribution This dataset combines the following sources: | Source | Tag in `source` | Rows | License | Link | Notes | |--------|-----------------|------|---------|------|-------| | nvBench (Tsinghua DB Group) | `nvbench` | 24,201 | MIT | [GitHub](https://github.com/TsinghuaDatabaseGroup/nvBench) | Gold-standard NL-to-visualization benchmark. 7,247 base entries with up to 5 NL paraphrases each. Chart types: bar, line, scatter, donut. Titles backfilled from NL questions. | | OpenAI gpt-4.1-nano synthesis | `synth-openai-gpt41nano` | 9,207 | Apache-2.0 | [text-to-sql-mix-v2](https://huggingface.co/datasets/DanielRegaladoCardoso/text-to-sql-mix-v2) | Chart specs synthesized from SQL mix v2 questions via OpenAI Batch API. System prompt distills Tufte/Knaflic/Few storytelling principles. Includes insight-driven titles and rationale. | ## Storytelling principles The synthesis system prompt distills data-visualization best practices from: - **Edward Tufte** -- data-ink ratio, integrity, small multiples - **Cole Nussbaumer Knaflic** -- clutter elimination, action-driven titles - **Stephen Few** -- perceptual encoding, dashboard hygiene Models trained on this dataset learn: - Correct chart type selection (from 33k examples) - Axis encoding (which column maps to x/y/color) - Insight-driven titles ("Sales grew 47% in Q4" not "Sales by month") - Smart sorting (value-desc for rankings, natural for time) - Color strategy (highlight key finding, gray background) - Rationale (model can explain its choice) ## Pipeline Build script: [`training/data_pipelines/build_chart_mix.py`](https://github.com/DanielRegaladoUMiami/sql-agent-llmops/blob/main/training/data_pipelines/build_chart_mix.py) Stages: `nvbench` (load + convert) -> `synth-prepare` (sample SQL mix, build batch JSONL) -> `synth-submit` (OpenAI Batch API) -> `synth-fetch` (download results) -> `combine-push` (merge, dedup, split, push). Title enrichment: [`training/data_pipelines/enrich_chart_titles.py`](https://github.com/DanielRegaladoUMiami/sql-agent-llmops/blob/main/training/data_pipelines/enrich_chart_titles.py) ## Usage ```python from datasets import load_dataset import json ds = load_dataset("DanielRegaladoCardoso/chart-reasoning-mix-v1") ex = ds["train"][0] spec = json.loads(ex["chart_spec"]) print(ex["instruction"]) print(spec["chart_type"], spec["title"]) print(spec["rationale"]) ``` ## Known limitations - nvBench titles are derived from the NL question (descriptive, not insight-driven). Only synth rows (28%) have true storytelling titles. - `data_profile` does not include actual row data -- only column names and types. The model cannot reason about specific values. - Difficulty labels are heuristic, not human-judged. ## Citation ```bibtex @dataset{regalado2026chartmix, author = {Regalado Cardoso, Daniel}, title = {Chart Reasoning Mix v1}, year = {2026}, url = {https://huggingface.co/datasets/DanielRegaladoCardoso/chart-reasoning-mix-v1} } ``` Plus the original nvBench citation: ```bibtex @inproceedings{luo2021nvbench, title = {Synthesizing Natural Language to Visualization (NL2VIS) Benchmarks from NL2SQL Benchmarks}, author = {Luo, Yuyu and Tang, Nan and Li, Guoliang and Tang, Jiawei and Chai, Chengliang and Qin, Xuedi}, booktitle = {SIGMOD}, year = {2021} } ``` ## License Pipeline and curation: **Apache-2.0**. Row content inherits upstream licenses (MIT for nvBench, Apache-2.0 for synth). See source attribution table. --- _Built by Daniel Regalado Cardoso -- MSBA, University of Miami -- April 2026._

提供机构：

DanielRegaladoCardoso

5,000+

优质数据集

54 个

任务类型

进入经典数据集