DanielRegaladoCardoso/svg-chart-render-v1
收藏Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/DanielRegaladoCardoso/svg-chart-render-v1
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
task_categories:
- text-generation
tags:
- svg
- chart-rendering
- data-visualization
- code-generation
- chart-spec-to-svg
pretty_name: SVG Chart Render Mix v1
size_categories:
- 10K<n<100K
---
# SVG Chart Render Mix v1
Training data for fine-tuning a small code model (DeepSeek Coder 1.3B)
to map **(chart specification JSON) to inline SVG code**.
> Part of the [SQL Agent LLMOps](https://github.com/DanielRegaladoUMiami/sql-agent-llmops) project.
| Total | Sources | Input | Output |
|-------|---------|-------|--------|
| ~25,000 rows | 2 | structured JSON chart spec | rendered SVG string |
## Part of the SQL Agent LLMOps project
| Dataset | Model | Role |
|---------|-------|------|
| [`DanielRegaladoCardoso/text-to-sql-mix-v2`](https://huggingface.co/datasets/DanielRegaladoCardoso/text-to-sql-mix-v2) | Qwen 2.5 Coder 7B | NL question to SQL |
| [`DanielRegaladoCardoso/chart-reasoning-mix-v1`](https://huggingface.co/datasets/DanielRegaladoCardoso/chart-reasoning-mix-v1) | Phi-3 Mini 3.8B | (question + result) to chart spec |
| **[`DanielRegaladoCardoso/svg-chart-render-v1`](https://huggingface.co/datasets/DanielRegaladoCardoso/svg-chart-render-v1)** | **DeepSeek Coder 1.3B** | **chart spec to SVG** |
## Schema
| Field | Type | Description |
|-------|------|-------------|
| `id` | string | Stable hash-based identifier |
| `chart_spec` | string (JSON) | Input: chart type, data points, encoding, title, axis labels |
| `svg_code` | string | Target output: full `<svg>...</svg>` inline SVG |
| `source` | string | Origin tag (see source attribution) |
| `metadata` | string (JSON) | chart_type, num_points, svg_size_bytes |
Note: `chart_spec` and `metadata` are stored as JSON strings in the parquet.
Parse with `json.loads(row["chart_spec"])` after loading.
### chart_spec structure
```json
{
"chart_type": "bar|line|scatter|donut|histogram|area",
"data": [{"x": "value", "y": "value", "color": "value|null"}],
"encoding": {"x": "col_name", "y": "col_name", "color": "col_name|null"},
"title": "string or null",
"x_label": "string or null",
"y_label": "string or null"
}
```
## Splits
| Split | Rows |
|-------|------|
| train | ~23,750 |
| validation | ~625 |
| test | ~625 |
## Source attribution
| Source | Tag in `source` | Rows | License | Link | Method | Notes |
|--------|-----------------|------|---------|------|--------|-------|
| nvBench chart renders | `synth-matplotlib` | ~10,500 | Apache-2.0 (pipeline) + MIT (nvBench data) | [nvBench GitHub](https://github.com/TsinghuaDatabaseGroup/nvBench) | Programmatic | Each of 7,247 nvBench chart configs rendered via matplotlib's SVG backend. Up to 3 title-augmented variants per entry. Chart types: bar, line, scatter, donut. Perfect (spec, svg) pairs. |
| svgen-500k filtered | `svgen500k-*` | ~15,000 | Per upstream row | [umuthopeyildirim/svgen-500k](https://huggingface.co/datasets/umuthopeyildirim/svgen-500k) | Filtered | Streamed 216k SVGs, heuristic-filtered to chart-shaped structures (multi-element: has `<rect>`, `<line>`, `<g>`). Rejects single-path icons. Provides general SVG syntax fluency. |
## Pipeline
Build script: [`training/data_pipelines/build_svg_mix.py`](https://github.com/DanielRegaladoUMiami/sql-agent-llmops/blob/main/training/data_pipelines/build_svg_mix.py)
Three stages:
1. **synth-charts** -- load nvBench chart configs, replay each in matplotlib
(Agg/SVG backend), augment with NL paraphrases as candidate titles.
2. **svgen** -- stream `umuthopeyildirim/svgen-500k`, keep only chart-shaped SVGs.
3. **combine-push** -- dedup by SVG hash, 95/2.5/2.5 split, push to HF with card.
## Usage
```python
from datasets import load_dataset
import json
ds = load_dataset("DanielRegaladoCardoso/svg-chart-render-v1")
ex = ds["train"][0]
spec = json.loads(ex["chart_spec"])
print(spec["chart_type"])
print(ex["svg_code"][:200])
```
### SFT format
```python
import json
def to_sft(row):
return {
"messages": [
{"role": "system", "content": "You render chart specifications as inline SVG."},
{"role": "user", "content": "Render this chart spec as SVG:\n\n" + row["chart_spec"]},
{"role": "assistant", "content": row["svg_code"]},
]
}
```
## Known limitations
- SVGs from `synth-matplotlib` carry matplotlib's stylistic defaults (font,
axes, ticks). The model will produce matplotlib-flavored SVGs.
- `svgen500k-*` rows have no structured `chart_spec` -- only freeform
`title` and `_freeform_description` are populated.
- SVGs are capped at 50 KB to keep training tractable.
- Only 6 chart types covered (bar, line, scatter, donut, histogram, area).
## Citation
```bibtex
@dataset{regalado2026svgmix,
author = {Regalado Cardoso, Daniel},
title = {SVG Chart Render Mix v1},
year = {2026},
url = {https://huggingface.co/datasets/DanielRegaladoCardoso/svg-chart-render-v1}
}
```
## License
Pipeline and curation: **Apache-2.0**. nvBench data is MIT-licensed.
svgen-500k rows carry per-row `license` fields from their upstream sources.
See the source attribution table.
---
_Built by Daniel Regalado Cardoso -- MSBA, University of Miami -- April 2026._
提供机构:
DanielRegaladoCardoso



