five

jiafr1802/SpotSFT-200k

收藏
Hugging Face2026-03-04 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/jiafr1802/SpotSFT-200k
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 task_categories: - visual-question-answering - text-generation - image-classification language: - en tags: - geo-localization - sft - multimodal - visual-qa size_categories: - 100K<n<1M configs: - config_name: default data_files: - split: train path: data/train-* --- # SpotSFT-200k: Visual QA Dataset for Geo-localization Alignment <div align="center"> **[Project Page](https://jiafr1802.github.io/SpotAgent-Paper/)** </div> ## Dataset Description **SpotSFT-200k** is a large-scale multimodal instruction-tuning dataset comprising approximately **200,000** image-text pairs. It is designed for the **Supervised Fine-Tuning (SFT)** stage of the **SpotAgent** framework (Stage 1). Unlike the subsequent `SpotAgenticCoT` dataset which focuses on complex tool use and reasoning, **SpotSFT-200k** aims to: 1. **Inject Basic World Knowledge:** Align the Large Vision-Language Model (LVLM) with broad geographical concepts using a massive amount of real-world imagery. 2. **Format Alignment:** Teach the model to adhere to the specific Geo-localization output format (Country, City, Latitude, Longitude) required for downstream tasks. ### Key Features * **Large-Scale Coverage:** Contains ~200k samples randomly sampled (5%) from the **MP16-Pro** dataset, ensuring a diverse distribution of global locations (natural landscapes, urban settings, landmarks). * **High-Quality Metadata:** Derived from MP16-Pro, which filters out samples with ambiguous or incomplete metadata while retaining hierarchical textual descriptions (Continent, Country, Region, City). * **Visual QA Format:** Formatted as standard multimodal conversation data, transforming raw image-coordinate pairs into an instruction-following task. ## Dataset Structure The dataset follows a standard conversation format compatible with models like Qwen-VL or LLaVA. ### Data Fields - `id`: Unique identifier for the sample. - `image`: The input query image. - `conversations`: A list of messages between "user" and "assistant". - **User:** Contains the prompt asking for geo-localization. - **Assistant:** The ground-truth location formatted as a structured answer. ### Example Sample *Note: This stage focuses on direct answers or simple reasoning to establish the output format.* ```json { "id": "sample_12345", "image": "<image_object>", "conversations": [ { "from": "user", "value": "You are a helpful assistant. Your task is to determine the geographic location of an image through systematic visual analysis.\n<image>\nProvide the final answer inside <answer> ... </answer>." }, { "from": "assistant", "value": "<answer> Country: France, City: Paris, Latitude: 48.8566, Longitude: 2.3522 </answer>" } ] }
提供机构:
jiafr1802
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作