jiafr1802/SpotSFT-200k
收藏Hugging Face2026-03-04 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/jiafr1802/SpotSFT-200k
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
task_categories:
- visual-question-answering
- text-generation
- image-classification
language:
- en
tags:
- geo-localization
- sft
- multimodal
- visual-qa
size_categories:
- 100K<n<1M
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# SpotSFT-200k: Visual QA Dataset for Geo-localization Alignment
<div align="center">
**[Project Page](https://jiafr1802.github.io/SpotAgent-Paper/)**
</div>
## Dataset Description
**SpotSFT-200k** is a large-scale multimodal instruction-tuning dataset comprising approximately **200,000** image-text pairs. It is designed for the **Supervised Fine-Tuning (SFT)** stage of the **SpotAgent** framework (Stage 1).
Unlike the subsequent `SpotAgenticCoT` dataset which focuses on complex tool use and reasoning, **SpotSFT-200k** aims to:
1. **Inject Basic World Knowledge:** Align the Large Vision-Language Model (LVLM) with broad geographical concepts using a massive amount of real-world imagery.
2. **Format Alignment:** Teach the model to adhere to the specific Geo-localization output format (Country, City, Latitude, Longitude) required for downstream tasks.
### Key Features
* **Large-Scale Coverage:** Contains ~200k samples randomly sampled (5%) from the **MP16-Pro** dataset, ensuring a diverse distribution of global locations (natural landscapes, urban settings, landmarks).
* **High-Quality Metadata:** Derived from MP16-Pro, which filters out samples with ambiguous or incomplete metadata while retaining hierarchical textual descriptions (Continent, Country, Region, City).
* **Visual QA Format:** Formatted as standard multimodal conversation data, transforming raw image-coordinate pairs into an instruction-following task.
## Dataset Structure
The dataset follows a standard conversation format compatible with models like Qwen-VL or LLaVA.
### Data Fields
- `id`: Unique identifier for the sample.
- `image`: The input query image.
- `conversations`: A list of messages between "user" and "assistant".
- **User:** Contains the prompt asking for geo-localization.
- **Assistant:** The ground-truth location formatted as a structured answer.
### Example Sample
*Note: This stage focuses on direct answers or simple reasoning to establish the output format.*
```json
{
"id": "sample_12345",
"image": "<image_object>",
"conversations": [
{
"from": "user",
"value": "You are a helpful assistant. Your task is to determine the geographic location of an image through systematic visual analysis.\n<image>\nProvide the final answer inside <answer> ... </answer>."
},
{
"from": "assistant",
"value": "<answer> Country: France, City: Paris, Latitude: 48.8566, Longitude: 2.3522 </answer>"
}
]
}
提供机构:
jiafr1802



