PHY041/sc4021-travel-opinion-search
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/PHY041/sc4021-travel-opinion-search
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
task_categories:
- text-classification
- feature-extraction
language:
- en
- zh
tags:
- opinion-mining
- sentiment-analysis
- travel
- instagram
- pinterest
- information-retrieval
size_categories:
- 100K<n<1M
---
# SC4021 Travel Opinion Search Engine Dataset
Crawled social media data for the NTU SC4021 Information Retrieval course project — an opinion-aware travel search engine for China travel content.
## Dataset Structure
| Split | Rows | Description |
|-------|------|-------------|
| `ig_posts` | ~105K | Instagram posts with captions, locations, image metadata, quality scores, and VLM classifications |
| `ig_comments` | ~117K | Instagram comments linked to posts |
| `ig_users` | ~121K | Instagram user profiles (username, bio, followers) |
| `pinterest_pins` | ~100K+ | Pinterest pins with image URLs, descriptions, and quality scores |
## Key Fields (ig_posts)
| Field | Type | Description |
|-------|------|-------------|
| `id` | str | Instagram shortcode |
| `caption` / `caption_clean` | str | Original and cleaned caption text |
| `language` | str | Detected language (en/zh) |
| `province` / `city` | str | Mapped China location |
| `image_category` | str | CLIP zero-shot category (landscape, food, culture, etc.) |
| `quality_score` | float | VLM quality assessment |
| `image_description` | str | VLM-generated image description |
| `likes` / `comments_count` | int | Engagement metrics |
| `location_lat` / `location_lng` | float | Geolocation |
## Sources
- **Instagram**: Travel hashtags (#chinatravel, #travelchina, etc.) and brand accounts
- **Pinterest**: Travel photography, fashion, and destination boards
## Usage
```python
from datasets import load_dataset
ds = load_dataset("PHY041/sc4021-travel-opinion-search")
posts = ds["ig_posts"]
comments = ds["ig_comments"]
```
## Citation
NTU SC4021 Information Retrieval Project, AY2025/26 Semester 2.
## License
CC BY-NC 4.0 — for academic/research use only.
提供机构:
PHY041



