five

DanielRegaladoCardoso/svg-chart-render-v1

收藏
Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/DanielRegaladoCardoso/svg-chart-render-v1
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: apache-2.0 task_categories: - text-generation tags: - svg - chart-rendering - data-visualization - code-generation - chart-spec-to-svg pretty_name: SVG Chart Render Mix v1 size_categories: - 10K<n<100K --- # SVG Chart Render Mix v1 Training data for fine-tuning a small code model (DeepSeek Coder 1.3B) to map **(chart specification JSON) to inline SVG code**. > Part of the [SQL Agent LLMOps](https://github.com/DanielRegaladoUMiami/sql-agent-llmops) project. | Total | Sources | Input | Output | |-------|---------|-------|--------| | ~25,000 rows | 2 | structured JSON chart spec | rendered SVG string | ## Part of the SQL Agent LLMOps project | Dataset | Model | Role | |---------|-------|------| | [`DanielRegaladoCardoso/text-to-sql-mix-v2`](https://huggingface.co/datasets/DanielRegaladoCardoso/text-to-sql-mix-v2) | Qwen 2.5 Coder 7B | NL question to SQL | | [`DanielRegaladoCardoso/chart-reasoning-mix-v1`](https://huggingface.co/datasets/DanielRegaladoCardoso/chart-reasoning-mix-v1) | Phi-3 Mini 3.8B | (question + result) to chart spec | | **[`DanielRegaladoCardoso/svg-chart-render-v1`](https://huggingface.co/datasets/DanielRegaladoCardoso/svg-chart-render-v1)** | **DeepSeek Coder 1.3B** | **chart spec to SVG** | ## Schema | Field | Type | Description | |-------|------|-------------| | `id` | string | Stable hash-based identifier | | `chart_spec` | string (JSON) | Input: chart type, data points, encoding, title, axis labels | | `svg_code` | string | Target output: full `<svg>...</svg>` inline SVG | | `source` | string | Origin tag (see source attribution) | | `metadata` | string (JSON) | chart_type, num_points, svg_size_bytes | Note: `chart_spec` and `metadata` are stored as JSON strings in the parquet. Parse with `json.loads(row["chart_spec"])` after loading. ### chart_spec structure ```json { "chart_type": "bar|line|scatter|donut|histogram|area", "data": [{"x": "value", "y": "value", "color": "value|null"}], "encoding": {"x": "col_name", "y": "col_name", "color": "col_name|null"}, "title": "string or null", "x_label": "string or null", "y_label": "string or null" } ``` ## Splits | Split | Rows | |-------|------| | train | ~23,750 | | validation | ~625 | | test | ~625 | ## Source attribution | Source | Tag in `source` | Rows | License | Link | Method | Notes | |--------|-----------------|------|---------|------|--------|-------| | nvBench chart renders | `synth-matplotlib` | ~10,500 | Apache-2.0 (pipeline) + MIT (nvBench data) | [nvBench GitHub](https://github.com/TsinghuaDatabaseGroup/nvBench) | Programmatic | Each of 7,247 nvBench chart configs rendered via matplotlib's SVG backend. Up to 3 title-augmented variants per entry. Chart types: bar, line, scatter, donut. Perfect (spec, svg) pairs. | | svgen-500k filtered | `svgen500k-*` | ~15,000 | Per upstream row | [umuthopeyildirim/svgen-500k](https://huggingface.co/datasets/umuthopeyildirim/svgen-500k) | Filtered | Streamed 216k SVGs, heuristic-filtered to chart-shaped structures (multi-element: has `<rect>`, `<line>`, `<g>`). Rejects single-path icons. Provides general SVG syntax fluency. | ## Pipeline Build script: [`training/data_pipelines/build_svg_mix.py`](https://github.com/DanielRegaladoUMiami/sql-agent-llmops/blob/main/training/data_pipelines/build_svg_mix.py) Three stages: 1. **synth-charts** -- load nvBench chart configs, replay each in matplotlib (Agg/SVG backend), augment with NL paraphrases as candidate titles. 2. **svgen** -- stream `umuthopeyildirim/svgen-500k`, keep only chart-shaped SVGs. 3. **combine-push** -- dedup by SVG hash, 95/2.5/2.5 split, push to HF with card. ## Usage ```python from datasets import load_dataset import json ds = load_dataset("DanielRegaladoCardoso/svg-chart-render-v1") ex = ds["train"][0] spec = json.loads(ex["chart_spec"]) print(spec["chart_type"]) print(ex["svg_code"][:200]) ``` ### SFT format ```python import json def to_sft(row): return { "messages": [ {"role": "system", "content": "You render chart specifications as inline SVG."}, {"role": "user", "content": "Render this chart spec as SVG:\n\n" + row["chart_spec"]}, {"role": "assistant", "content": row["svg_code"]}, ] } ``` ## Known limitations - SVGs from `synth-matplotlib` carry matplotlib's stylistic defaults (font, axes, ticks). The model will produce matplotlib-flavored SVGs. - `svgen500k-*` rows have no structured `chart_spec` -- only freeform `title` and `_freeform_description` are populated. - SVGs are capped at 50 KB to keep training tractable. - Only 6 chart types covered (bar, line, scatter, donut, histogram, area). ## Citation ```bibtex @dataset{regalado2026svgmix, author = {Regalado Cardoso, Daniel}, title = {SVG Chart Render Mix v1}, year = {2026}, url = {https://huggingface.co/datasets/DanielRegaladoCardoso/svg-chart-render-v1} } ``` ## License Pipeline and curation: **Apache-2.0**. nvBench data is MIT-licensed. svgen-500k rows carry per-row `license` fields from their upstream sources. See the source attribution table. --- _Built by Daniel Regalado Cardoso -- MSBA, University of Miami -- April 2026._
提供机构:
DanielRegaladoCardoso
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作