five

henry1477/pcbslm-static-v2-unsloth-vlm-gui

收藏
Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/henry1477/pcbslm-static-v2-unsloth-vlm-gui
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other task_categories: - image-text-to-text - visual-question-answering language: - en pretty_name: PCBSLM static-v2 Unsloth VLM tags: - unsloth - gemma-4 - multimodal - pcb - electronics size_categories: - 1K<n<10K configs: - config_name: default data_files: - split: train path: data/vlm_train.jsonl - split: validation path: data/vlm_val.jsonl - split: test path: data/vlm_test.jsonl --- # PCBSLM static-v2 Unsloth VLM Portable multimodal Unsloth dataset for PCB layout/document-grounded training. The JSONL splits use Unsloth/Gemma-style chat messages: ```json { "messages": [ {"role": "user", "content": [ {"type": "image", "image": "https://huggingface.co/datasets/henry1477/pcbslm-static-v2-unsloth-vlm-gui/resolve/main/assets/raw_docs/.../images/page.png"}, {"type": "text", "text": "instruction..."} ]}, {"role": "assistant", "content": [ {"type": "text", "text": "{...json answer...}"} ]} ] } ``` ## Files - `data/vlm_train.jsonl`: 1455 multimodal examples - `data/vlm_val.jsonl`: 100 multimodal examples - `data/vlm_test.jsonl`: 135 multimodal examples - `assets/raw_docs/`: source documents, rendered pages, and figure crops referenced by examples/metadata - `assets/board_images/`: board render images referenced by examples - `metadata_bundle.tar.gz`: document/evidence/rule metadata with repo-relative asset paths - `asset_manifest.jsonl`: image refs with asset kind, local existence, and dimensions - `quality_dropped_rows.jsonl`: deterministic filter drops and reasons - `reports/quality_audit.md`: duplicate, citation, image mix, and docpack quality audit Rows include `quality_score`, `quality_weight`, `quality_reasons`, and `mixture_bucket`. Use `quality_weight` or `mixture_bucket` in your sampler to keep document-grounded rows near 35-45% of training batches without duplicating examples. ## Unsloth Smoke Test This was verified locally with `unsloth/gemma-4-E2B-it` using: ```bash python scripts/smoke_train_unsloth_vlm.py \ --dataset-repo-id henry1477/pcbslm-static-v2-unsloth-vlm-gui \ --model-name unsloth/gemma-4-E2B-it \ --limit 4 \ --max-steps 2 \ --max-images 1 \ --max-seq-length 512 \ --resize 256 ``` Image entries are HTTPS URLs to files in this dataset repo. The repository includes the referenced document and board assets. HF repo: `henry1477/pcbslm-static-v2-unsloth-vlm-gui`

--- 许可证:其他 任务类别: - 图像-文本转文本 - 视觉问答 语言: - 英语 友好展示名称:PCBSLM static-v2 Unsloth 视觉语言模型(Vision-Language Model, VLM) 标签: - unsloth - gemma-4 - 多模态 - 印刷电路板(Printed Circuit Board, PCB) - 电子学 样本规模类别: - 1000 < n < 10000 配置: - 配置名称:default 数据文件: - 拆分集:训练集 路径:data/vlm_train.jsonl - 拆分集:验证集 路径:data/vlm_val.jsonl - 拆分集:测试集 路径:data/vlm_test.jsonl --- # PCBSLM static-v2 Unsloth 视觉语言模型(Vision-Language Model, VLM) 适用于印刷电路板(Printed Circuit Board, PCB)布局与文档驱动训练的可移植多模态Unsloth数据集。 本数据集采用JSONL格式拆分,使用Unsloth/Gemma风格的对话消息格式: json { "messages": [ {"role": "user", "content": [ {"type": "image", "image": "https://huggingface.co/datasets/henry1477/pcbslm-static-v2-unsloth-vlm-gui/resolve/main/assets/raw_docs/.../images/page.png"}, {"type": "text", "text": "instruction..."} ]}, {"role": "assistant", "content": [ {"type": "text", "text": "{...json answer...}"} ]} ] } ## 文件说明 - `data/vlm_train.jsonl`:包含1455条多模态样本 - `data/vlm_val.jsonl`:包含100条多模态样本 - `data/vlm_test.jsonl`:包含135条多模态样本 - `assets/raw_docs/`:存放样本与元数据所引用的源文档、渲染页面及截图裁剪区域 - `assets/board_images/`:存放样本所引用的电路板渲染图像 - `metadata_bundle.tar.gz`:包含文档、佐证材料与规则元数据,附带仓库相对路径的资产索引 - `asset_manifest.jsonl`:记录图像引用信息,包含资产类型、本地存在状态与图像尺寸 - `quality_dropped_rows.jsonl`:记录经确定性过滤剔除的样本及剔除原因 - `reports/quality_audit.md`:包含重复样本、引用关系、图像混合及文档包质量的审计报告 每条样本包含`quality_score`(质量评分)、`quality_weight`(质量权重)、`quality_reasons`(质量归因)与`mixture_bucket`(混合桶)字段。在采样器中可使用`quality_weight`或`mixture_bucket`,将文档驱动样本的占比控制在训练批次的35%-45%区间内,同时避免样本重复。 ## Unsloth 冒烟测试 本数据集已通过`unsloth/gemma-4-E2B-it`在本地完成冒烟测试,测试命令如下: bash python scripts/smoke_train_unsloth_vlm.py --dataset-repo-id henry1477/pcbslm-static-v2-unsloth-vlm-gui --model-name unsloth/gemma-4-E2B-it --limit 4 --max-steps 2 --max-images 1 --max-seq-length 512 --resize 256 图像条目均为指向本数据集仓库内文件的HTTPS链接,本仓库包含所有被引用的文档与电路板资产。 Hugging Face仓库地址:`henry1477/pcbslm-static-v2-unsloth-vlm-gui`
提供机构:
henry1477
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作