henry1477/pcbslm-static-v2-unsloth-vlm-gui
收藏Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/henry1477/pcbslm-static-v2-unsloth-vlm-gui
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
task_categories:
- image-text-to-text
- visual-question-answering
language:
- en
pretty_name: PCBSLM static-v2 Unsloth VLM
tags:
- unsloth
- gemma-4
- multimodal
- pcb
- electronics
size_categories:
- 1K<n<10K
configs:
- config_name: default
data_files:
- split: train
path: data/vlm_train.jsonl
- split: validation
path: data/vlm_val.jsonl
- split: test
path: data/vlm_test.jsonl
---
# PCBSLM static-v2 Unsloth VLM
Portable multimodal Unsloth dataset for PCB layout/document-grounded training.
The JSONL splits use Unsloth/Gemma-style chat messages:
```json
{
"messages": [
{"role": "user", "content": [
{"type": "image", "image": "https://huggingface.co/datasets/henry1477/pcbslm-static-v2-unsloth-vlm-gui/resolve/main/assets/raw_docs/.../images/page.png"},
{"type": "text", "text": "instruction..."}
]},
{"role": "assistant", "content": [
{"type": "text", "text": "{...json answer...}"}
]}
]
}
```
## Files
- `data/vlm_train.jsonl`: 1455 multimodal examples
- `data/vlm_val.jsonl`: 100 multimodal examples
- `data/vlm_test.jsonl`: 135 multimodal examples
- `assets/raw_docs/`: source documents, rendered pages, and figure crops referenced by examples/metadata
- `assets/board_images/`: board render images referenced by examples
- `metadata_bundle.tar.gz`: document/evidence/rule metadata with repo-relative asset paths
- `asset_manifest.jsonl`: image refs with asset kind, local existence, and dimensions
- `quality_dropped_rows.jsonl`: deterministic filter drops and reasons
- `reports/quality_audit.md`: duplicate, citation, image mix, and docpack quality audit
Rows include `quality_score`, `quality_weight`, `quality_reasons`, and `mixture_bucket`. Use
`quality_weight` or `mixture_bucket` in your sampler to keep document-grounded rows near 35-45%
of training batches without duplicating examples.
## Unsloth Smoke Test
This was verified locally with `unsloth/gemma-4-E2B-it` using:
```bash
python scripts/smoke_train_unsloth_vlm.py \
--dataset-repo-id henry1477/pcbslm-static-v2-unsloth-vlm-gui \
--model-name unsloth/gemma-4-E2B-it \
--limit 4 \
--max-steps 2 \
--max-images 1 \
--max-seq-length 512 \
--resize 256
```
Image entries are HTTPS URLs to files in this dataset repo. The repository includes the referenced document and board assets.
HF repo: `henry1477/pcbslm-static-v2-unsloth-vlm-gui`
---
许可证:其他
任务类别:
- 图像-文本转文本
- 视觉问答
语言:
- 英语
友好展示名称:PCBSLM static-v2 Unsloth 视觉语言模型(Vision-Language Model, VLM)
标签:
- unsloth
- gemma-4
- 多模态
- 印刷电路板(Printed Circuit Board, PCB)
- 电子学
样本规模类别:
- 1000 < n < 10000
配置:
- 配置名称:default
数据文件:
- 拆分集:训练集
路径:data/vlm_train.jsonl
- 拆分集:验证集
路径:data/vlm_val.jsonl
- 拆分集:测试集
路径:data/vlm_test.jsonl
---
# PCBSLM static-v2 Unsloth 视觉语言模型(Vision-Language Model, VLM)
适用于印刷电路板(Printed Circuit Board, PCB)布局与文档驱动训练的可移植多模态Unsloth数据集。
本数据集采用JSONL格式拆分,使用Unsloth/Gemma风格的对话消息格式:
json
{
"messages": [
{"role": "user", "content": [
{"type": "image", "image": "https://huggingface.co/datasets/henry1477/pcbslm-static-v2-unsloth-vlm-gui/resolve/main/assets/raw_docs/.../images/page.png"},
{"type": "text", "text": "instruction..."}
]},
{"role": "assistant", "content": [
{"type": "text", "text": "{...json answer...}"}
]}
]
}
## 文件说明
- `data/vlm_train.jsonl`:包含1455条多模态样本
- `data/vlm_val.jsonl`:包含100条多模态样本
- `data/vlm_test.jsonl`:包含135条多模态样本
- `assets/raw_docs/`:存放样本与元数据所引用的源文档、渲染页面及截图裁剪区域
- `assets/board_images/`:存放样本所引用的电路板渲染图像
- `metadata_bundle.tar.gz`:包含文档、佐证材料与规则元数据,附带仓库相对路径的资产索引
- `asset_manifest.jsonl`:记录图像引用信息,包含资产类型、本地存在状态与图像尺寸
- `quality_dropped_rows.jsonl`:记录经确定性过滤剔除的样本及剔除原因
- `reports/quality_audit.md`:包含重复样本、引用关系、图像混合及文档包质量的审计报告
每条样本包含`quality_score`(质量评分)、`quality_weight`(质量权重)、`quality_reasons`(质量归因)与`mixture_bucket`(混合桶)字段。在采样器中可使用`quality_weight`或`mixture_bucket`,将文档驱动样本的占比控制在训练批次的35%-45%区间内,同时避免样本重复。
## Unsloth 冒烟测试
本数据集已通过`unsloth/gemma-4-E2B-it`在本地完成冒烟测试,测试命令如下:
bash
python scripts/smoke_train_unsloth_vlm.py
--dataset-repo-id henry1477/pcbslm-static-v2-unsloth-vlm-gui
--model-name unsloth/gemma-4-E2B-it
--limit 4
--max-steps 2
--max-images 1
--max-seq-length 512
--resize 256
图像条目均为指向本数据集仓库内文件的HTTPS链接,本仓库包含所有被引用的文档与电路板资产。
Hugging Face仓库地址:`henry1477/pcbslm-static-v2-unsloth-vlm-gui`
提供机构:
henry1477



