DataSeeds.AI-Sample-Dataset-DSD
收藏魔搭社区2025-12-04 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/DataSeeds.AI-Sample-Dataset-DSD
下载链接
链接失效反馈官方服务:
资源简介:
# DataSeeds.AI Sample Dataset (DSD)

## Dataset Summary
The DataSeeds.AI Sample Dataset (DSD) is a high-fidelity, human-curated computer vision-ready dataset comprised of 7,772 peer-ranked, fully annotated photographic images, 350,000+ words of descriptive text, and comprehensive metadata. While the DSD is being released under an open source license, a sister dataset of over 10,000 fully annotated and segmented images is available for immediate commercial licensing, and the broader GuruShots ecosystem contains over 100 million images in its catalog.
Each image includes multi-tier human annotations and semantic segmentation masks. Generously contributed to the community by the GuruShots photography platform, where users engage in themed competitions, the DSD uniquely captures aesthetic preference signals and high-quality technical metadata (EXIF) across an expansive diversity of photographic styles, camera types, and subject matter. The dataset is optimized for fine-tuning and evaluating multimodal vision-language models, especially in scene description and stylistic comprehension tasks.
* **Technical Report** - [Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery](https://huggingface.co/papers/2506.05673)
* **Github Repo** - Access the complete weights and code which were used to evaluate the DSD -- [https://github.com/DataSeeds-ai/DSD-finetune-blip-llava](https://github.com/DataSeeds-ai/DSD-finetune-blip-llava)
This dataset is ready for commercial/non-commercial use.
## Dataset Structure
* **Size**: 7,772 images (7,010 train, 762 validation)
* **Format**: Apache Parquet files for metadata, with images in JPG format
* **Total Size**: ~4.1GB
* **Languages**: English (annotations)
* **Annotation Quality**: All annotations were verified through a multi-tier human-in-the-loop process
### Data Fields
| Column Name | Description | Data Type |
|-------------|-------------|-----------|
| `image_id` | Unique identifier for the image | string |
| `image` | Image file, PIL type | image |
| `image_title` | Human-written title summarizing the content or subject | string |
| `image_description` | Human-written narrative describing what is visibly present | string |
| `scene_description` | Technical and compositional details about image capture | string |
| `all_labels` | All object categories identified in the image | list of strings |
| `segmented_objects` | Objects/elements that have segmentation masks | list of strings |
| `segmentation_masks` | Segmentation polygons as coordinate points [x,y,...] | list of lists of floats |
| `exif_make` | Camera manufacturer | string |
| `exif_model` | Camera model | string |
| `exif_f_number` | Aperture value (lower = wider aperture) | string |
| `exif_exposure_time` | Sensor exposure time (e.g., 1/500 sec) | string |
| `exif_exposure_mode` | Camera exposure setting (Auto/Manual/etc.) | string |
| `exif_exposure_program` | Exposure program mode | string |
| `exif_metering_mode` | Light metering mode | string |
| `exif_lens` | Lens information and specifications | string |
| `exif_focal_length` | Lens focal length (millimeters) | string |
| `exif_iso` | Camera sensor sensitivity to light | string |
| `exif_date_original` | Original timestamp when image was taken | string |
| `exif_software` | Post-processing software used | string |
| `exif_orientation` | Image layout (horizontal/vertical) | string |
## How to Use
### Basic Loading
```python
from datasets import load_dataset
# Load the training split of the dataset
dataset = load_dataset("Dataseeds/DataSeeds.AI-Sample-Dataset-DSD", split="train")
# Access the first sample
sample = dataset[0]
# Extract the different features from the sample
image = sample["image"] # The PIL Image object
title = sample["image_title"]
description = sample["image_description"]
segments = sample["segmented_objects"]
masks = sample["segmentation_masks"] # The PIL Image object for the mask
print(f"Title: {title}")
print(f"Description: {description}")
print(f"Segmented objects: {segments}")
```
### PyTorch DataLoader
```python
from datasets import load_dataset
from torch.utils.data import DataLoader
import torch
# Load dataset
dataset = load_dataset("Dataseeds/DataSeeds.AI-Sample-Dataset-DSD", split="train")
# Convert to PyTorch format
dataset.set_format(type="torch", columns=["image", "image_title", "segmentation_masks"])
# Create DataLoader
dataloader = DataLoader(dataset, batch_size=16, shuffle=True)
```
### TensorFlow
```python
import tensorflow as tf
from datasets import load_dataset
TARGET_IMG_SIZE = (224, 224)
BATCH_SIZE = 16
dataset = load_dataset("Dataseeds/DataSeeds.AI-Sample-Dataset-DSD", split="train")
def hf_dataset_generator():
for example in dataset:
yield example['image'], example['image_title']
def preprocess(image, title):
# Resize the image to a fixed size
image = tf.image.resize(image, TARGET_IMG_SIZE)
image = tf.cast(image, tf.uint8)
return image, title
# The output_signature defines the data types and shapes
tf_dataset = tf.data.Dataset.from_generator(
hf_dataset_generator,
output_signature=(
tf.TensorSpec(shape=(None, None, 3), dtype=tf.uint8),
tf.TensorSpec(shape=(), dtype=tf.string),
)
)
# Apply the preprocessing, shuffle, and batch
tf_dataset = (
tf_dataset.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE)
.shuffle(buffer_size=100)
.batch(BATCH_SIZE)
.prefetch(tf.data.AUTOTUNE)
)
print("Dataset is ready.")
for images, titles in tf_dataset.take(1):
print("Image batch shape:", images.shape)
print("A title from the batch:", titles.numpy()[0].decode('utf-8'))
```
## Dataset Characterization
**Data Collection Method**: Manual curation from GuruShots photography platform
**Labeling Method**: Human annotators with multi-tier verification process
## Benchmark Results
To validate the impact of data quality, we fine-tuned two state-of-the-art vision-language models—**LLaVA-NEXT** and **BLIP2**—on the DSD scene description task. We observed consistent and measurable improvements over base models:
### LLaVA-NEXT Results
| Model | BLEU-4 | ROUGE-L | BERTScore F1 | CLIPScore |
|-------|--------|---------|--------------|-----------|
| Base | 0.0199 | 0.2089 | 0.2751 | 0.3247 |
| Fine-tuned | 0.0246 | 0.2140 | 0.2789 | 0.3260 |
| **Relative Improvement** | **+24.09%** | **+2.44%** | **+1.40%** | **+0.41%** |
### BLIP2 Results
| Model | BLEU-4 | ROUGE-L | BERTScore F1 | CLIPScore |
|-------|--------|---------|--------------|-----------|
| Base | 0.001 | 0.126 | 0.0545 | 0.2854 |
| Fine-tuned | 0.047 | 0.242 | -0.0537 | 0.2583 |
| **Relative Improvement** | **+4600%** | **+92.06%** | -198.53% | -9.49% |
These improvements demonstrate the dataset's value in improving scene understanding and textual grounding of visual features, especially in fine-grained photographic tasks.
## Use Cases
The DSD is perfect for fine-tuning multimodal models for:
* **Image captioning** - Rich human-written descriptions
* **Scene description** - Technical photography analysis
* **Semantic segmentation** - Pixel-level object understanding
* **Aesthetic evaluation** - Style classification based on peer rankings
* **EXIF-aware analysis** - Technical metadata integration
* **Multimodal training** - Vision-language model development
## Commercial Dataset Access & On-Demand Licensing
While the DSD is being released under an open source license, it represents only a small fraction of the broader commercial capabilities of the GuruShots ecosystem.
DataSeeds.AI operates a live, ongoing photography catalog that has amassed over 100 million images, sourced from both amateur and professional photographers participating in thousands of themed challenges across diverse geographic and stylistic contexts. Unlike most public datasets, this corpus is:
* Fully licensed for downstream use in AI training
* Backed by structured consent frameworks and traceable rights, with active opt-in from creators
* Rich in EXIF metadata, including camera model, lens type, and occasionally location data
* Curated through a built-in human preference signal based on competitive ranking, yielding rare insight into subjective aesthetic quality
### On-Demand Dataset Creation
Uniquely, DataSeeds.AI has the ability to source new image datasets to spec via a just-in-time, first-party data acquisition engine. Clients (e.g. AI labs, model developers, media companies) can request:
* Specific content themes (e.g., "urban decay at dusk," "elderly people with dogs in snowy environments")
* Defined technical attributes (camera type, exposure time, geographic constraints)
* Ethical/region-specific filtering (e.g., GDPR-compliant imagery, no identifiable faces, kosher food imagery)
* Matching segmentation masks, EXIF metadata, and tiered annotations
Within days, the DataSeeds.AI platform can launch curated challenges to its global network of contributors and deliver targeted datasets with commercial-grade licensing terms.
### Sales Inquiries
To inquire about licensing or customized dataset sourcing, contact:
**[sales@dataseeds.ai](mailto:sales@dataseeds.ai)**
## License & Citation
**License**: Apache 2.0
**For commercial licenses, annotation, or access to the full 100M+ image catalog with on-demand annotations**: [sales@dataseeds.ai](mailto:sales@dataseeds.ai)
### Citation
If you find the data useful, please cite:
```bibtex
@article{abdoli2025peerranked,
title={Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from GuruShots' Annotated Imagery},
author={Sajjad Abdoli and Freeman Lewin and Gediminas Vasiliauskas and Fabian Schonholz},
journal={arXiv preprint arXiv:2506.05673},
year={2025},
}
```
# DataSeeds.AI 示例数据集(DSD)

## 数据集概览
DataSeeds.AI 示例数据集(DSD)是一套高保真、人工精选的计算机视觉专用数据集,包含7772张经同行评议排名的全标注摄影图像、超35万字描述性文本以及完整的元数据。本数据集以开源许可协议发布,其姊妹数据集包含超10000张全标注与分割图像,可立即进行商业授权;更广泛的GuruShots生态系统目录中收录了超过1亿张图像。
每张图像均包含多层人工标注与语义分割掩码。本数据集由GuruShots摄影平台慷慨捐赠给社区,该平台的用户会参与主题摄影比赛;DSD独特地捕捉了多样摄影风格、相机类型与拍摄主题下的审美偏好信号与高质量技术元数据(EXIF)。本数据集专为多模态视觉语言模型的微调与评估优化,尤其适用于场景描述与风格理解任务。
* **技术报告** - [《同行评议精度:基于DataSeeds标注图像构建视觉模型微调基础数据集》](https://huggingface.co/papers/2506.05673)
* **Github 代码仓库** - 可获取用于评估DSD的完整权重与代码——[https://github.com/DataSeeds-ai/DSD-finetune-blip-llava](https://github.com/DataSeeds-ai/DSD-finetune-blip-llava)
本数据集可用于商业与非商业用途。
## 数据集结构
* **数据集规模**:7772张图像(训练集7010张,验证集762张)
* **数据格式**:元数据采用Apache Parquet文件格式,图像格式为JPG
* **总大小**:约4.1GB
* **语言**:标注文本为英语
* **标注质量**:所有标注均通过多层人工循环验证流程进行审核
### 数据字段
| 列名 | 描述 | 数据类型 |
|-------------|-------------|-----------|
| `image_id` | 图像唯一标识符 | 字符串 |
| `image` | 图像文件,PIL格式对象 | 图像 |
| `image_title` | 人工撰写的图像内容/主题摘要 | 字符串 |
| `image_description` | 人工撰写的图像可见内容叙事描述 | 字符串 |
| `scene_description` | 图像拍摄的技术与构图细节 | 字符串 |
| `all_labels` | 图像中识别出的所有对象类别 | 字符串列表 |
| `segmented_objects` | 带有分割掩码的对象/元素 | 字符串列表 |
| `segmentation_masks` | 以坐标点[x,y,...]形式表示的分割多边形 | 浮点型列表的列表 |
| `exif_make` | 相机制造商 | 字符串 |
| `exif_model` | 相机型号 | 字符串 |
| `exif_f_number` | 光圈值(数值越小,光圈越大) | 字符串 |
| `exif_exposure_time` | 传感器曝光时长(例如1/500秒) | 字符串 |
| `exif_exposure_mode` | 相机曝光模式(自动/手动等) | 字符串 |
| `exif_exposure_program` | 曝光程序模式 | 字符串 |
| `exif_metering_mode` | 测光模式 | 字符串 |
| `exif_lens` | 镜头信息与规格 | 字符串 |
| `exif_focal_length` | 镜头焦距(单位:毫米) | 字符串 |
| `exif_iso` | 相机传感器感光度 | 字符串 |
| `exif_date_original` | 图像拍摄的原始时间戳 | 字符串 |
| `exif_software` | 使用的后期处理软件 | 字符串 |
| `exif_orientation` | 图像布局(横向/纵向) | 字符串 |
## 使用方法
### 基础加载
python
from datasets import load_dataset
# 加载数据集训练子集
dataset = load_dataset("Dataseeds/DataSeeds.AI-Sample-Dataset-DSD", split="train")
# 获取第一条样本
sample = dataset[0]
# 从样本中提取各类特征
image = sample["image"] # PIL图像对象
title = sample["image_title"]
description = sample["image_description"]
segments = sample["segmented_objects"]
masks = sample["segmentation_masks"] # 掩码对应的PIL图像对象
print(f"标题: {title}")
print(f"描述: {description}")
print(f"分割对象: {segments}")
### PyTorch 数据加载器
python
from datasets import load_dataset
from torch.utils.data import DataLoader
import torch
# 加载数据集
dataset = load_dataset("Dataseeds/DataSeeds.AI-Sample-Dataset-DSD", split="train")
# 转换为PyTorch格式
dataset.set_format(type="torch", columns=["image", "image_title", "segmentation_masks"])
# 创建数据加载器
dataloader = DataLoader(dataset, batch_size=16, shuffle=True)
### TensorFlow
python
import tensorflow as tf
from datasets import load_dataset
TARGET_IMG_SIZE = (224, 224)
BATCH_SIZE = 16
dataset = load_dataset("Dataseeds/DataSeeds.AI-Sample-Dataset-DSD", split="train")
def hf_dataset_generator():
for example in dataset:
yield example['image'], example['image_title']
def preprocess(image, title):
# 将图像调整为固定尺寸
image = tf.image.resize(image, TARGET_IMG_SIZE)
image = tf.cast(image, tf.uint8)
return image, title
# output_signature定义了数据类型与形状
tf_dataset = tf.data.Dataset.from_generator(
hf_dataset_generator,
output_signature=(
tf.TensorSpec(shape=(None, None, 3), dtype=tf.uint8),
tf.TensorSpec(shape=(), dtype=tf.string),
)
)
# 应用预处理、打乱与批处理
tf_dataset = (
tf_dataset.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE)
.shuffle(buffer_size=100)
.batch(BATCH_SIZE)
.prefetch(tf.data.AUTOTUNE)
)
print("数据集已就绪。")
for images, titles in tf_dataset.take(1):
print("图像批次形状:", images.shape)
print("批次中的一条标题:", titles.numpy()[0].decode('utf-8'))
## 数据集特征
**数据收集方式**:从GuruShots摄影平台人工精选获取
**标注方式**:由人工标注者完成,并经过多层验证流程
## 基准测试结果
为验证数据质量的影响,我们在DSD场景描述任务上对两款当前领先的视觉语言模型——LLaVA-NEXT与BLIP2——进行了微调。相较于基础模型,我们观察到了一致且可量化的性能提升:
### LLaVA-NEXT 实验结果
| 模型 | BLEU-4 | ROUGE-L | BERTScore F1 | CLIPScore |
|-------|--------|---------|--------------|-----------|
| 基础模型 | 0.0199 | 0.2089 | 0.2751 | 0.3247 |
| 微调后模型 | 0.0246 | 0.2140 | 0.2789 | 0.3260 |
| **相对提升率** | **+24.09%** | **+2.44%** | **+1.40%** | **+0.41%** |
### BLIP2 实验结果
| 模型 | BLEU-4 | ROUGE-L | BERTScore F1 | CLIPScore |
|-------|--------|---------|--------------|-----------|
| 基础模型 | 0.001 | 0.126 | 0.0545 | 0.2854 |
| 微调后模型 | 0.047 | 0.242 | -0.0537 | 0.2583 |
| **相对提升率** | **+4600%** | **+92.06%** | -198.53% | -9.49% |
上述提升证明了本数据集在提升场景理解与视觉特征文本接地方面的价值,尤其在细粒度摄影任务中效果显著。
## 应用场景
DSD非常适合用于微调多模态模型,以完成以下任务:
* **图像字幕生成** - 基于丰富的人工撰写描述
* **场景描述** - 摄影技术分析
* **语义分割** - 像素级对象理解
* **审美评估** - 基于同行排名的风格分类
* **感知EXIF的分析** - 技术元数据集成
* **多模态训练** - 视觉语言模型开发
## 商业数据集获取与按需授权
尽管DSD以开源许可协议发布,但它仅为GuruShots生态系统商业能力的一小部分。
DataSeeds.AI运营着一个持续更新的摄影图库,已积累超过1亿张图像,这些图像来自业余与专业摄影师,他们参与了数千个跨越不同地理与风格场景的主题挑战。与大多数公开数据集不同,该图库具备以下特点:
* 可完全授权用于人工智能训练的下游任务
* 拥有结构化的同意框架与可追溯的版权,创作者均主动选择加入
* 包含丰富的EXIF元数据,涵盖相机型号、镜头类型,部分数据还包含拍摄位置信息
* 通过基于竞争性排名的内置人类偏好信号进行精选,可提供关于主观审美质量的珍贵洞察
### 按需定制数据集
尤为独特的是,DataSeeds.AI可通过即时自研数据采集引擎,按照客户需求定制全新的图像数据集。客户(例如人工智能实验室、模型开发者、媒体公司)可提出以下需求:
* 特定的内容主题(例如“黄昏时分的城市废墟”、“雪地中与狗狗相伴的老年人”)
* 明确的技术属性(相机类型、曝光时长、地理限制条件)
* 符合伦理或地区要求的筛选(例如符合GDPR标准的图像、无可识别人脸的图像、符合犹太洁食标准的食物图像)
* 配套的分割掩码、EXIF元数据与多层标注
DataSeeds.AI平台可在数日内面向其全球贡献者网络发起精选挑战,并交付符合商业级授权条款的定制数据集。
### 销售咨询
如需咨询授权或定制数据集采购事宜,请联系:**[sales@dataseeds.ai](mailto:sales@dataseeds.ai)**
## 许可协议与引用规范
**许可协议**:Apache 2.0
如需获取商业许可、标注服务或访问包含按需标注的1亿+图像完整图库,请联系:[sales@dataseeds.ai](mailto:sales@dataseeds.ai)
### 引用规范
若您认为本数据集对您的研究有所帮助,请引用以下文献:
bibtex
@article{abdoli2025peerranked,
title={Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from GuruShots' Annotated Imagery},
author={Sajjad Abdoli and Freeman Lewin and Gediminas Vasiliauskas and Fabian Schonholz},
journal={arXiv preprint arXiv:2506.05673},
year={2025},
}
提供机构:
maas
创建时间:
2025-06-15



