five

uv-scripts/object-detection

收藏
Hugging Face2026-03-06 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/uv-scripts/object-detection
下载链接
链接失效反馈
官方服务:
资源简介:
--- viewer: false tags: [uv-script, object-detection] --- # Object Detection Dataset Scripts 5 scripts to convert, validate, inspect, diff, and sample object detection datasets on the Hub. Supports 6 bbox formats — no setup required. This repository is inspired by [panlabel](https://github.com/strickvl/panlabel) ## Quick Start Convert bounding box formats without cloning anything: ```bash # Convert COCO-style bboxes to YOLO normalized format uv run convert-hf-dataset.py merve/coco-dataset merve/coco-yolo \ --from coco_xywh --to yolo --max-samples 100 ``` That's it! The script will: - Load the dataset from the Hub - Convert all bounding boxes in-place - Push the result to a new dataset repo - View results at: `https://huggingface.co/datasets/merve/coco-yolo` ## Scripts | Script | Description | |--------|-------------| | `convert-hf-dataset.py` | Convert between 6 bbox formats and push to Hub | | `validate-hf-dataset.py` | Check annotations for errors (invalid bboxes, duplicates, bounds) | | `stats-hf-dataset.py` | Compute statistics (counts, label histogram, area, co-occurrence) | | `diff-hf-datasets.py` | Compare two datasets semantically (IoU-based annotation matching) | | `sample-hf-dataset.py` | Create subsets (random or stratified) and push to Hub | ## Supported Bbox Formats All scripts support these 6 bounding box formats, matching the [panlabel](https://github.com/strickvl/panlabel) Rust CLI: | Format | Encoding | Coordinate Space | |--------|----------|------------------| | `coco_xywh` | `[x, y, width, height]` | Pixels | | `xyxy` | `[xmin, ymin, xmax, ymax]` | Pixels | | `voc` | `[xmin, ymin, xmax, ymax]` | Pixels (alias for `xyxy`) | | `yolo` | `[center_x, center_y, width, height]` | Normalized 0–1 | | `tfod` | `[xmin, ymin, xmax, ymax]` | Normalized 0–1 | | `label_studio` | `[x, y, width, height]` | Percentage 0–100 | Conversions go through XYXY pixel-space as the intermediate representation, so any format can be converted to any other format. ## Common Options All scripts accept flexible column mapping. Datasets can store annotations as flat columns or nested under an `objects` dict — both layouts are handled automatically. | Option | Description | |--------|-------------| | `--bbox-column` | Column containing bboxes (default: `bbox`) | | `--category-column` | Column containing category labels (default: `category`) | | `--width-column` | Column for image width (default: `width`) | | `--height-column` | Column for image height (default: `height`) | | `--split` | Dataset split (default: `train`) | | `--max-samples` | Limit number of samples (useful for testing) | | `--hf-token` | HF API token (or set `HF_TOKEN` env var) | | `--private` | Make output dataset private | Every script supports `--help` to see all available options: ```bash uv run convert-hf-dataset.py --help ``` ## Convert (`convert-hf-dataset.py`) Convert bounding boxes between any of the 6 supported formats: ```bash # COCO -> XYXY uv run convert-hf-dataset.py merve/license-plates merve/license-plates-voc \ --from coco_xywh --to voc # YOLO -> COCO uv run convert-hf-dataset.py merve/license-plates merve/license-plates-yolo \ --from coco_xywh --to yolo # TFOD (normalized xyxy) -> COCO uv run convert-hf-dataset.py merve/license-plates-tfod merve/license-plates-coco \ --from tfod --to coco_xywh # Label Studio (percentage xywh) -> XYXY uv run convert-hf-dataset.py merve/ls-dataset merve/ls-xyxy \ --from label_studio --to xyxy # Test on 10 samples first uv run convert-hf-dataset.py merve/dataset merve/converted \ --from xyxy --to yolo --max-samples 10 # Shuffle before converting a subset uv run convert-hf-dataset.py merve/dataset merve/converted \ --from coco_xywh --to tfod --max-samples 500 --shuffle ``` | Option | Description | |--------|-------------| | `--from` | Source bbox format (required) | | `--to` | Target bbox format (required) | | `--batch-size` | Batch size for map (default: 1000) | | `--create-pr` | Push as PR instead of direct commit | | `--shuffle` | Shuffle dataset before processing | | `--seed` | Random seed for shuffling (default: 42) | ## Validate (`validate-hf-dataset.py`) Check annotations for common issues: ```bash # Basic validation uv run validate-hf-dataset.py merve/coco-dataset # Validate YOLO-format dataset uv run validate-hf-dataset.py merve/yolo-dataset --bbox-format yolo # Validate TFOD-format dataset uv run validate-hf-dataset.py merve/tfod-dataset --bbox-format tfod # Strict mode (warnings become errors) uv run validate-hf-dataset.py merve/dataset --strict # JSON report uv run validate-hf-dataset.py merve/dataset --report json # Stream large datasets without full download uv run validate-hf-dataset.py merve/huge-dataset --streaming --max-samples 5000 # Push validation report to Hub uv run validate-hf-dataset.py merve/dataset --output-dataset merve/validation-report ``` **Issue Codes:** | Code | Level | Description | |------|-------|-------------| | E001 | Error | Bbox/category count mismatch | | E002 | Error | Invalid bbox (missing values) | | E003 | Error | Non-finite coordinates (NaN/Inf) | | E004 | Error | xmin > xmax | | E005 | Error | ymin > ymax | | W001 | Warning | No annotations in example | | W002 | Warning | Zero or negative area | | W003 | Warning | Bbox before image origin | | W004 | Warning | Bbox beyond image bounds | | W005 | Warning | Empty category label | | W006 | Warning | Duplicate file name | ## Stats (`stats-hf-dataset.py`) Compute rich statistics for a dataset: ```bash # Basic stats uv run stats-hf-dataset.py merve/coco-dataset # Top 20 label histogram, JSON output uv run stats-hf-dataset.py merve/dataset --top 20 --report json # Stats for TFOD-format dataset uv run stats-hf-dataset.py merve/dataset --bbox-format tfod # Stream large datasets uv run stats-hf-dataset.py merve/huge-dataset --streaming --max-samples 10000 # Push stats report to Hub uv run stats-hf-dataset.py merve/dataset --output-dataset merve/stats-report ``` Reports include: summary counts, label distribution, annotation density, bbox area/aspect ratio distributions, per-category area stats, category co-occurrence pairs, and image resolution distribution. ## Diff (`diff-hf-datasets.py`) Compare two datasets semantically using IoU-based annotation matching: ```bash # Basic diff uv run diff-hf-datasets.py merve/dataset-v1 merve/dataset-v2 # Stricter matching uv run diff-hf-datasets.py merve/old merve/new --iou-threshold 0.7 # Per-annotation change details uv run diff-hf-datasets.py merve/old merve/new --detail # JSON report uv run diff-hf-datasets.py merve/old merve/new --report json ``` Reports include: shared/unique images, shared/unique categories, matched/added/removed/modified annotations. ## Sample (`sample-hf-dataset.py`) Create random or stratified subsets: ```bash # Random 500 samples uv run sample-hf-dataset.py merve/dataset merve/subset -n 500 # 10% fraction uv run sample-hf-dataset.py merve/dataset merve/subset --fraction 0.1 # Stratified sampling (preserves class distribution) uv run sample-hf-dataset.py merve/dataset merve/subset \ -n 200 --strategy stratified # Filter by categories uv run sample-hf-dataset.py merve/dataset merve/subset \ -n 100 --categories "cat,dog,bird" # Reproducible sampling uv run sample-hf-dataset.py merve/dataset merve/subset \ -n 500 --seed 42 ``` | Option | Description | |--------|-------------| | `-n` | Number of samples to select | | `--fraction` | Fraction of dataset (0.0–1.0) | | `--strategy` | `random` (default) or `stratified` | | `--categories` | Comma-separated list of categories to filter by | | `--category-mode` | `images` (default) or `annotations` | ## Run Locally ```bash # Clone and run git clone https://huggingface.co/datasets/uv-scripts/panlabel cd panlabel uv run convert-hf-dataset.py input-dataset output-dataset --from coco_xywh --to yolo # Or run directly from URL uv run https://huggingface.co/datasets/uv-scripts/panlabel/raw/main/convert-hf-dataset.py \ input-dataset output-dataset --from coco_xywh --to yolo ``` Works with any Hugging Face dataset containing object detection annotations — COCO, YOLO, VOC, TFOD, or Label Studio format.

数据集查看器:禁用 标签:[uv-script, 目标检测] # 目标检测数据集脚本 本仓库包含5个用于在Hugging Face Hub上转换、验证、检视、比对与采样目标检测数据集的脚本,支持6种边界框格式,无需额外配置。本仓库灵感源自[panlabel](https://github.com/strickvl/panlabel)。 ## 快速入门 无需克隆任何内容即可完成边界框格式转换: bash # 将COCO风格的边界框转换为YOLO归一化格式 uv run convert-hf-dataset.py merve/coco-dataset merve/coco-yolo --from coco_xywh --to yolo --max-samples 100 操作完成后,脚本将: - 从Hugging Face Hub加载数据集 - 就地转换所有边界框 - 将转换结果推送至新的数据集仓库 可通过以下链接查看结果:`https://huggingface.co/datasets/merve/coco-yolo` ## 脚本列表 | 脚本名称 | 功能描述 | |--------|-------------| | `convert-hf-dataset.py` | 在6种边界框格式间转换并推送至Hugging Face Hub | | `validate-hf-dataset.py` | 检查标注错误(无效边界框、重复项、坐标越界) | | `stats-hf-dataset.py` | 计算统计信息(计数、标签直方图、面积、共现关系) | | `diff-hf-datasets.py` | 语义化比对两个数据集(基于交并比(IoU)的标注匹配) | | `sample-hf-dataset.py` | 创建数据集子集(随机或分层采样)并推送至Hugging Face Hub | ## 支持的边界框格式 所有脚本均支持以下6种边界框格式,与panlabel的Rust命令行工具保持一致: | 格式名称 | 编码方式 | 坐标空间 | |--------|----------|------------------| | `coco_xywh` | `[x, y, width, height]` | 像素 | | `xyxy` | `[xmin, ymin, xmax, ymax]` | 像素 | | `voc` | `[xmin, ymin, xmax, ymax]` | 像素(`xyxy`的别名) | | `yolo` | `[center_x, center_y, width, height]` | 归一化范围0–1 | | `tfod` | `[xmin, ymin, xmax, ymax]` | 归一化范围0–1 | | `label_studio` | `[x, y, width, height]` | 百分比范围0–100 | 所有格式转换均以XYXY像素空间作为中间表示,因此任意格式间均可互相转换。 ## 通用选项 所有脚本均支持灵活的列映射机制。数据集的标注既可存储为扁平列,也可嵌套于`objects`字典中,两种布局均会被自动处理。 | 选项参数 | 功能描述 | |--------|-------------| | `--bbox-column` | 存储边界框的列名(默认:`bbox`) | | `--category-column` | 存储类别标签的列名(默认:`category`) | | `--width-column` | 存储图像宽度的列名(默认:`width`) | | `--height-column` | 存储图像高度的列名(默认:`height`) | | `--split` | 数据集拆分(默认:`train`) | | `--max-samples` | 限制采样样本数量(便于测试) | | `--hf-token` | Hugging Face API令牌(或通过`HF_TOKEN`环境变量设置) | | `--private` | 将输出数据集设为私有 | 每个脚本均支持`--help`参数查看所有可用选项: bash uv run convert-hf-dataset.py --help ## 格式转换工具(`convert-hf-dataset.py`) 在6种支持的格式间转换边界框: bash # 将COCO格式转换为XYXY格式 uv run convert-hf-dataset.py merve/license-plates merve/license-plates-voc --from coco_xywh --to voc # 将YOLO格式转换为COCO格式 uv run convert-hf-dataset.py merve/license-plates merve/license-plates-yolo --from coco_xywh --to yolo # 将TFOD(归一化XYXY格式)转换为COCO格式 uv run convert-hf-dataset.py merve/license-plates-tfod merve/license-plates-coco --from tfod --to coco_xywh # 将Label Studio(百分比XYWH格式)转换为XYXY格式 uv run convert-hf-dataset.py merve/ls-dataset merve/ls-xyxy --from label_studio --to xyxy # 先对10个样本进行测试 uv run convert-hf-dataset.py merve/dataset merve/converted --from xyxy --to yolo --max-samples 10 # 对子集进行混洗后再转换 uv run convert-hf-dataset.py merve/dataset merve/converted --from coco_xywh --to tfod --max-samples 500 --shuffle | 选项参数 | 功能描述 | |--------|-------------| | `--from` | 源边界框格式(必填) | | `--to` | 目标边界框格式(必填) | | `--batch-size` | 映射批处理大小(默认:1000) | | `--create-pr` | 以拉取请求而非直接提交的方式推送 | | `--shuffle` | 处理前对数据集进行混洗 | | `--seed` | 混洗随机种子(默认:42) | ## 验证工具(`validate-hf-dataset.py`) 检查标注的常见问题: bash # 基础验证 uv run validate-hf-dataset.py merve/coco-dataset # 验证YOLO格式数据集 uv run validate-hf-dataset.py merve/yolo-dataset --bbox-format yolo # 验证TFOD格式数据集 uv run validate-hf-dataset.py merve/tfod-dataset --bbox-format tfod # 严格模式(警告将被视为错误) uv run validate-hf-dataset.py merve/dataset --strict # 生成JSON报告 uv run validate-hf-dataset.py merve/dataset --report json # 流式处理大型数据集,无需完整下载 uv run validate-hf-dataset.py merve/huge-dataset --streaming --max-samples 5000 # 将验证报告推送至Hugging Face Hub uv run validate-hf-dataset.py merve/dataset --output-dataset merve/validation-report **错误代码说明:** | 错误代码 | 级别 | 功能描述 | |------|-------|-------------| | E001 | 错误 | 边界框与类别数量不匹配 | | E002 | 错误 | 无效边界框(缺失值) | | E003 | 错误 | 非有限坐标(NaN/Inf) | | E004 | 错误 | xmin大于xmax | | E005 | 错误 | ymin大于ymax | | W001 | 警告 | 样本无标注 | | W002 | 警告 | 面积为零或负值 | | W003 | 警告 | 边界框位于图像原点之外 | | W004 | 警告 | 边界框超出图像边界 | | W005 | 警告 | 类别标签为空 | | W006 | 警告 | 文件名重复 | ## 统计工具(`stats-hf-dataset.py`) 计算数据集的丰富统计信息: bash # 基础统计 uv run stats-hf-dataset.py merve/coco-dataset # 输出前20个标签的直方图,格式为JSON uv run stats-hf-dataset.py merve/dataset --top 20 --report json # 统计TFOD格式数据集 uv run stats-hf-dataset.py merve/dataset --bbox-format tfod # 流式处理大型数据集 uv run stats-hf-dataset.py merve/huge-dataset --streaming --max-samples 10000 # 将统计报告推送至Hugging Face Hub uv run stats-hf-dataset.py merve/dataset --output-dataset merve/stats-report 生成的报告包含:汇总计数、标签分布、标注密度、边界框面积/宽高比分布、逐类别面积统计、类别共现对,以及图像分辨率分布。 ## 比对工具(`diff-hf-datasets.py`) 基于交并比(IoU)的标注匹配实现语义化比对两个数据集: bash # 基础比对 uv run diff-hf-datasets.py merve/dataset-v1 merve/dataset-v2 # 更严格的匹配阈值 uv run diff-hf-datasets.py merve/old merve/new --iou-threshold 0.7 # 输出逐标注变更详情 uv run diff-hf-datasets.py merve/old merve/new --detail # 生成JSON报告 uv run diff-hf-datasets.py merve/old merve/new --report json 报告包含:共享/独有图像、共享/独有类别、匹配/新增/移除/修改的标注。 ## 采样工具(`sample-hf-dataset.py`) 创建随机或分层采样的数据集子集: bash # 随机采样500个样本 uv run sample-hf-dataset.py merve/dataset merve/subset -n 500 # 采样10%的数据集比例 uv run sample-hf-dataset.py merve/dataset merve/subset --fraction 0.1 # 分层采样(保留类别分布) uv run sample-hf-dataset.py merve/dataset merve/subset -n 200 --strategy stratified # 按类别过滤采样 uv run sample-hf-dataset.py merve/dataset merve/subset -n 100 --categories "cat,dog,bird" # 可复现的采样 uv run sample-hf-dataset.py merve/dataset merve/subset -n 500 --seed 42 | 选项参数 | 功能描述 | |--------|-------------| | `-n` | 需选择的样本数量 | | `--fraction` | 数据集采样比例(0.0–1.0) | | `--strategy` | 采样策略:`random`(默认)或`stratified`(分层) | | `--categories` | 需过滤的类别逗号分隔列表 | | `--category-mode` | 类别过滤模式:`images`(默认)或`annotations` | ## 本地运行方式 bash # 克隆仓库并运行 git clone https://huggingface.co/datasets/uv-scripts/panlabel cd panlabel uv run convert-hf-dataset.py input-dataset output-dataset --from coco_xywh --to yolo # 或直接通过URL运行 uv run https://huggingface.co/datasets/uv-scripts/panlabel/raw/main/convert-hf-dataset.py input-dataset output-dataset --from coco_xywh --to yolo 本工具兼容所有包含目标检测标注的Hugging Face数据集,支持COCO、YOLO、VOC、TFOD或Label Studio格式。
提供机构:
uv-scripts
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作