uv-scripts/object-detection
收藏Hugging Face2026-03-06 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/uv-scripts/object-detection
下载链接
链接失效反馈官方服务:
资源简介:
---
viewer: false
tags: [uv-script, object-detection]
---
# Object Detection Dataset Scripts
5 scripts to convert, validate, inspect, diff, and sample object detection datasets on the Hub. Supports 6 bbox formats — no setup required.
This repository is inspired by [panlabel](https://github.com/strickvl/panlabel)
## Quick Start
Convert bounding box formats without cloning anything:
```bash
# Convert COCO-style bboxes to YOLO normalized format
uv run convert-hf-dataset.py merve/coco-dataset merve/coco-yolo \
--from coco_xywh --to yolo --max-samples 100
```
That's it! The script will:
- Load the dataset from the Hub
- Convert all bounding boxes in-place
- Push the result to a new dataset repo
- View results at: `https://huggingface.co/datasets/merve/coco-yolo`
## Scripts
| Script | Description |
|--------|-------------|
| `convert-hf-dataset.py` | Convert between 6 bbox formats and push to Hub |
| `validate-hf-dataset.py` | Check annotations for errors (invalid bboxes, duplicates, bounds) |
| `stats-hf-dataset.py` | Compute statistics (counts, label histogram, area, co-occurrence) |
| `diff-hf-datasets.py` | Compare two datasets semantically (IoU-based annotation matching) |
| `sample-hf-dataset.py` | Create subsets (random or stratified) and push to Hub |
## Supported Bbox Formats
All scripts support these 6 bounding box formats, matching the [panlabel](https://github.com/strickvl/panlabel) Rust CLI:
| Format | Encoding | Coordinate Space |
|--------|----------|------------------|
| `coco_xywh` | `[x, y, width, height]` | Pixels |
| `xyxy` | `[xmin, ymin, xmax, ymax]` | Pixels |
| `voc` | `[xmin, ymin, xmax, ymax]` | Pixels (alias for `xyxy`) |
| `yolo` | `[center_x, center_y, width, height]` | Normalized 0–1 |
| `tfod` | `[xmin, ymin, xmax, ymax]` | Normalized 0–1 |
| `label_studio` | `[x, y, width, height]` | Percentage 0–100 |
Conversions go through XYXY pixel-space as the intermediate representation, so any format can be converted to any other format.
## Common Options
All scripts accept flexible column mapping. Datasets can store annotations as flat columns or nested under an `objects` dict — both layouts are handled automatically.
| Option | Description |
|--------|-------------|
| `--bbox-column` | Column containing bboxes (default: `bbox`) |
| `--category-column` | Column containing category labels (default: `category`) |
| `--width-column` | Column for image width (default: `width`) |
| `--height-column` | Column for image height (default: `height`) |
| `--split` | Dataset split (default: `train`) |
| `--max-samples` | Limit number of samples (useful for testing) |
| `--hf-token` | HF API token (or set `HF_TOKEN` env var) |
| `--private` | Make output dataset private |
Every script supports `--help` to see all available options:
```bash
uv run convert-hf-dataset.py --help
```
## Convert (`convert-hf-dataset.py`)
Convert bounding boxes between any of the 6 supported formats:
```bash
# COCO -> XYXY
uv run convert-hf-dataset.py merve/license-plates merve/license-plates-voc \
--from coco_xywh --to voc
# YOLO -> COCO
uv run convert-hf-dataset.py merve/license-plates merve/license-plates-yolo \
--from coco_xywh --to yolo
# TFOD (normalized xyxy) -> COCO
uv run convert-hf-dataset.py merve/license-plates-tfod merve/license-plates-coco \
--from tfod --to coco_xywh
# Label Studio (percentage xywh) -> XYXY
uv run convert-hf-dataset.py merve/ls-dataset merve/ls-xyxy \
--from label_studio --to xyxy
# Test on 10 samples first
uv run convert-hf-dataset.py merve/dataset merve/converted \
--from xyxy --to yolo --max-samples 10
# Shuffle before converting a subset
uv run convert-hf-dataset.py merve/dataset merve/converted \
--from coco_xywh --to tfod --max-samples 500 --shuffle
```
| Option | Description |
|--------|-------------|
| `--from` | Source bbox format (required) |
| `--to` | Target bbox format (required) |
| `--batch-size` | Batch size for map (default: 1000) |
| `--create-pr` | Push as PR instead of direct commit |
| `--shuffle` | Shuffle dataset before processing |
| `--seed` | Random seed for shuffling (default: 42) |
## Validate (`validate-hf-dataset.py`)
Check annotations for common issues:
```bash
# Basic validation
uv run validate-hf-dataset.py merve/coco-dataset
# Validate YOLO-format dataset
uv run validate-hf-dataset.py merve/yolo-dataset --bbox-format yolo
# Validate TFOD-format dataset
uv run validate-hf-dataset.py merve/tfod-dataset --bbox-format tfod
# Strict mode (warnings become errors)
uv run validate-hf-dataset.py merve/dataset --strict
# JSON report
uv run validate-hf-dataset.py merve/dataset --report json
# Stream large datasets without full download
uv run validate-hf-dataset.py merve/huge-dataset --streaming --max-samples 5000
# Push validation report to Hub
uv run validate-hf-dataset.py merve/dataset --output-dataset merve/validation-report
```
**Issue Codes:**
| Code | Level | Description |
|------|-------|-------------|
| E001 | Error | Bbox/category count mismatch |
| E002 | Error | Invalid bbox (missing values) |
| E003 | Error | Non-finite coordinates (NaN/Inf) |
| E004 | Error | xmin > xmax |
| E005 | Error | ymin > ymax |
| W001 | Warning | No annotations in example |
| W002 | Warning | Zero or negative area |
| W003 | Warning | Bbox before image origin |
| W004 | Warning | Bbox beyond image bounds |
| W005 | Warning | Empty category label |
| W006 | Warning | Duplicate file name |
## Stats (`stats-hf-dataset.py`)
Compute rich statistics for a dataset:
```bash
# Basic stats
uv run stats-hf-dataset.py merve/coco-dataset
# Top 20 label histogram, JSON output
uv run stats-hf-dataset.py merve/dataset --top 20 --report json
# Stats for TFOD-format dataset
uv run stats-hf-dataset.py merve/dataset --bbox-format tfod
# Stream large datasets
uv run stats-hf-dataset.py merve/huge-dataset --streaming --max-samples 10000
# Push stats report to Hub
uv run stats-hf-dataset.py merve/dataset --output-dataset merve/stats-report
```
Reports include: summary counts, label distribution, annotation density, bbox area/aspect ratio distributions, per-category area stats, category co-occurrence pairs, and image resolution distribution.
## Diff (`diff-hf-datasets.py`)
Compare two datasets semantically using IoU-based annotation matching:
```bash
# Basic diff
uv run diff-hf-datasets.py merve/dataset-v1 merve/dataset-v2
# Stricter matching
uv run diff-hf-datasets.py merve/old merve/new --iou-threshold 0.7
# Per-annotation change details
uv run diff-hf-datasets.py merve/old merve/new --detail
# JSON report
uv run diff-hf-datasets.py merve/old merve/new --report json
```
Reports include: shared/unique images, shared/unique categories, matched/added/removed/modified annotations.
## Sample (`sample-hf-dataset.py`)
Create random or stratified subsets:
```bash
# Random 500 samples
uv run sample-hf-dataset.py merve/dataset merve/subset -n 500
# 10% fraction
uv run sample-hf-dataset.py merve/dataset merve/subset --fraction 0.1
# Stratified sampling (preserves class distribution)
uv run sample-hf-dataset.py merve/dataset merve/subset \
-n 200 --strategy stratified
# Filter by categories
uv run sample-hf-dataset.py merve/dataset merve/subset \
-n 100 --categories "cat,dog,bird"
# Reproducible sampling
uv run sample-hf-dataset.py merve/dataset merve/subset \
-n 500 --seed 42
```
| Option | Description |
|--------|-------------|
| `-n` | Number of samples to select |
| `--fraction` | Fraction of dataset (0.0–1.0) |
| `--strategy` | `random` (default) or `stratified` |
| `--categories` | Comma-separated list of categories to filter by |
| `--category-mode` | `images` (default) or `annotations` |
## Run Locally
```bash
# Clone and run
git clone https://huggingface.co/datasets/uv-scripts/panlabel
cd panlabel
uv run convert-hf-dataset.py input-dataset output-dataset --from coco_xywh --to yolo
# Or run directly from URL
uv run https://huggingface.co/datasets/uv-scripts/panlabel/raw/main/convert-hf-dataset.py \
input-dataset output-dataset --from coco_xywh --to yolo
```
Works with any Hugging Face dataset containing object detection annotations — COCO, YOLO, VOC, TFOD, or Label Studio format.
数据集查看器:禁用
标签:[uv-script, 目标检测]
# 目标检测数据集脚本
本仓库包含5个用于在Hugging Face Hub上转换、验证、检视、比对与采样目标检测数据集的脚本,支持6种边界框格式,无需额外配置。本仓库灵感源自[panlabel](https://github.com/strickvl/panlabel)。
## 快速入门
无需克隆任何内容即可完成边界框格式转换:
bash
# 将COCO风格的边界框转换为YOLO归一化格式
uv run convert-hf-dataset.py merve/coco-dataset merve/coco-yolo
--from coco_xywh --to yolo --max-samples 100
操作完成后,脚本将:
- 从Hugging Face Hub加载数据集
- 就地转换所有边界框
- 将转换结果推送至新的数据集仓库
可通过以下链接查看结果:`https://huggingface.co/datasets/merve/coco-yolo`
## 脚本列表
| 脚本名称 | 功能描述 |
|--------|-------------|
| `convert-hf-dataset.py` | 在6种边界框格式间转换并推送至Hugging Face Hub |
| `validate-hf-dataset.py` | 检查标注错误(无效边界框、重复项、坐标越界) |
| `stats-hf-dataset.py` | 计算统计信息(计数、标签直方图、面积、共现关系) |
| `diff-hf-datasets.py` | 语义化比对两个数据集(基于交并比(IoU)的标注匹配) |
| `sample-hf-dataset.py` | 创建数据集子集(随机或分层采样)并推送至Hugging Face Hub |
## 支持的边界框格式
所有脚本均支持以下6种边界框格式,与panlabel的Rust命令行工具保持一致:
| 格式名称 | 编码方式 | 坐标空间 |
|--------|----------|------------------|
| `coco_xywh` | `[x, y, width, height]` | 像素 |
| `xyxy` | `[xmin, ymin, xmax, ymax]` | 像素 |
| `voc` | `[xmin, ymin, xmax, ymax]` | 像素(`xyxy`的别名) |
| `yolo` | `[center_x, center_y, width, height]` | 归一化范围0–1 |
| `tfod` | `[xmin, ymin, xmax, ymax]` | 归一化范围0–1 |
| `label_studio` | `[x, y, width, height]` | 百分比范围0–100 |
所有格式转换均以XYXY像素空间作为中间表示,因此任意格式间均可互相转换。
## 通用选项
所有脚本均支持灵活的列映射机制。数据集的标注既可存储为扁平列,也可嵌套于`objects`字典中,两种布局均会被自动处理。
| 选项参数 | 功能描述 |
|--------|-------------|
| `--bbox-column` | 存储边界框的列名(默认:`bbox`) |
| `--category-column` | 存储类别标签的列名(默认:`category`) |
| `--width-column` | 存储图像宽度的列名(默认:`width`) |
| `--height-column` | 存储图像高度的列名(默认:`height`) |
| `--split` | 数据集拆分(默认:`train`) |
| `--max-samples` | 限制采样样本数量(便于测试) |
| `--hf-token` | Hugging Face API令牌(或通过`HF_TOKEN`环境变量设置) |
| `--private` | 将输出数据集设为私有 |
每个脚本均支持`--help`参数查看所有可用选项:
bash
uv run convert-hf-dataset.py --help
## 格式转换工具(`convert-hf-dataset.py`)
在6种支持的格式间转换边界框:
bash
# 将COCO格式转换为XYXY格式
uv run convert-hf-dataset.py merve/license-plates merve/license-plates-voc
--from coco_xywh --to voc
# 将YOLO格式转换为COCO格式
uv run convert-hf-dataset.py merve/license-plates merve/license-plates-yolo
--from coco_xywh --to yolo
# 将TFOD(归一化XYXY格式)转换为COCO格式
uv run convert-hf-dataset.py merve/license-plates-tfod merve/license-plates-coco
--from tfod --to coco_xywh
# 将Label Studio(百分比XYWH格式)转换为XYXY格式
uv run convert-hf-dataset.py merve/ls-dataset merve/ls-xyxy
--from label_studio --to xyxy
# 先对10个样本进行测试
uv run convert-hf-dataset.py merve/dataset merve/converted
--from xyxy --to yolo --max-samples 10
# 对子集进行混洗后再转换
uv run convert-hf-dataset.py merve/dataset merve/converted
--from coco_xywh --to tfod --max-samples 500 --shuffle
| 选项参数 | 功能描述 |
|--------|-------------|
| `--from` | 源边界框格式(必填) |
| `--to` | 目标边界框格式(必填) |
| `--batch-size` | 映射批处理大小(默认:1000) |
| `--create-pr` | 以拉取请求而非直接提交的方式推送 |
| `--shuffle` | 处理前对数据集进行混洗 |
| `--seed` | 混洗随机种子(默认:42) |
## 验证工具(`validate-hf-dataset.py`)
检查标注的常见问题:
bash
# 基础验证
uv run validate-hf-dataset.py merve/coco-dataset
# 验证YOLO格式数据集
uv run validate-hf-dataset.py merve/yolo-dataset --bbox-format yolo
# 验证TFOD格式数据集
uv run validate-hf-dataset.py merve/tfod-dataset --bbox-format tfod
# 严格模式(警告将被视为错误)
uv run validate-hf-dataset.py merve/dataset --strict
# 生成JSON报告
uv run validate-hf-dataset.py merve/dataset --report json
# 流式处理大型数据集,无需完整下载
uv run validate-hf-dataset.py merve/huge-dataset --streaming --max-samples 5000
# 将验证报告推送至Hugging Face Hub
uv run validate-hf-dataset.py merve/dataset --output-dataset merve/validation-report
**错误代码说明:**
| 错误代码 | 级别 | 功能描述 |
|------|-------|-------------|
| E001 | 错误 | 边界框与类别数量不匹配 |
| E002 | 错误 | 无效边界框(缺失值) |
| E003 | 错误 | 非有限坐标(NaN/Inf) |
| E004 | 错误 | xmin大于xmax |
| E005 | 错误 | ymin大于ymax |
| W001 | 警告 | 样本无标注 |
| W002 | 警告 | 面积为零或负值 |
| W003 | 警告 | 边界框位于图像原点之外 |
| W004 | 警告 | 边界框超出图像边界 |
| W005 | 警告 | 类别标签为空 |
| W006 | 警告 | 文件名重复 |
## 统计工具(`stats-hf-dataset.py`)
计算数据集的丰富统计信息:
bash
# 基础统计
uv run stats-hf-dataset.py merve/coco-dataset
# 输出前20个标签的直方图,格式为JSON
uv run stats-hf-dataset.py merve/dataset --top 20 --report json
# 统计TFOD格式数据集
uv run stats-hf-dataset.py merve/dataset --bbox-format tfod
# 流式处理大型数据集
uv run stats-hf-dataset.py merve/huge-dataset --streaming --max-samples 10000
# 将统计报告推送至Hugging Face Hub
uv run stats-hf-dataset.py merve/dataset --output-dataset merve/stats-report
生成的报告包含:汇总计数、标签分布、标注密度、边界框面积/宽高比分布、逐类别面积统计、类别共现对,以及图像分辨率分布。
## 比对工具(`diff-hf-datasets.py`)
基于交并比(IoU)的标注匹配实现语义化比对两个数据集:
bash
# 基础比对
uv run diff-hf-datasets.py merve/dataset-v1 merve/dataset-v2
# 更严格的匹配阈值
uv run diff-hf-datasets.py merve/old merve/new --iou-threshold 0.7
# 输出逐标注变更详情
uv run diff-hf-datasets.py merve/old merve/new --detail
# 生成JSON报告
uv run diff-hf-datasets.py merve/old merve/new --report json
报告包含:共享/独有图像、共享/独有类别、匹配/新增/移除/修改的标注。
## 采样工具(`sample-hf-dataset.py`)
创建随机或分层采样的数据集子集:
bash
# 随机采样500个样本
uv run sample-hf-dataset.py merve/dataset merve/subset -n 500
# 采样10%的数据集比例
uv run sample-hf-dataset.py merve/dataset merve/subset --fraction 0.1
# 分层采样(保留类别分布)
uv run sample-hf-dataset.py merve/dataset merve/subset
-n 200 --strategy stratified
# 按类别过滤采样
uv run sample-hf-dataset.py merve/dataset merve/subset
-n 100 --categories "cat,dog,bird"
# 可复现的采样
uv run sample-hf-dataset.py merve/dataset merve/subset
-n 500 --seed 42
| 选项参数 | 功能描述 |
|--------|-------------|
| `-n` | 需选择的样本数量 |
| `--fraction` | 数据集采样比例(0.0–1.0) |
| `--strategy` | 采样策略:`random`(默认)或`stratified`(分层) |
| `--categories` | 需过滤的类别逗号分隔列表 |
| `--category-mode` | 类别过滤模式:`images`(默认)或`annotations` |
## 本地运行方式
bash
# 克隆仓库并运行
git clone https://huggingface.co/datasets/uv-scripts/panlabel
cd panlabel
uv run convert-hf-dataset.py input-dataset output-dataset --from coco_xywh --to yolo
# 或直接通过URL运行
uv run https://huggingface.co/datasets/uv-scripts/panlabel/raw/main/convert-hf-dataset.py
input-dataset output-dataset --from coco_xywh --to yolo
本工具兼容所有包含目标检测标注的Hugging Face数据集,支持COCO、YOLO、VOC、TFOD或Label Studio格式。
提供机构:
uv-scripts



