honvoh-school/youtube-tourism-sentiment
收藏Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/honvoh-school/youtube-tourism-sentiment
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
task_categories:
- text-classification
language:
- en
- fr
- tr
- es
- de
- pt
tags:
- youtube
- sentiment
- tourism
- multilingual
- africa
- latam
pretty_name: YouTube Tourism Sentiment (multilingual)
size_categories:
- 10K<n<100K
---
# YouTube Tourism Sentiment (multilingual)
YouTube comments from tourism-related videos across Africa, Latin America, and Turkey, cleaned and pre-labeled with `cardiffnlp/twitter-xlm-roberta-base-sentiment`. Built for a school LLM project (Farel Honvoh / Elif Oksuzali).
## Contents
- **comments.parquet** — one row per comment, joined with video and channel metadata
- **videos.parquet** — video-level metadata with `algorithm_score = view_count / subscriber_count`
## v2 snapshot
- **20,650 comments** across **26 videos** from **18 channels**
- Scope: tourism content in Africa (Benin, Egypt, Nigeria, etc.), Latin America (Colombia, Brazil, Venezuela, Peru), and Turkey
- Labels generated by a multilingual pre-trained model (no human labeling)
## Schema — comments.parquet
| Column | Type | Description |
|---|---|---|
| comment_id | str | Truncated SHA1 of (video_id + author + text[:100]) |
| video_id | str | YouTube video ID |
| channel_id | str | Channel ID |
| channel_name | str | Channel name |
| video_title | str | Video title |
| algorithm_score | float | views / subscribers (proxy for video over-performance) |
| author | str | Comment author username |
| text_raw | str | Original text (emojis preserved) |
| text_clean | str | Cleaned text (URLs and mentions stripped, abbreviations expanded) |
| lang | str | ISO language code (langdetect) |
| likes | int | Like count |
| reply_count | int | Reply count |
| published_at | str | Publication date |
| is_reply | int | 1 if reply to another comment, 0 otherwise |
| sentiment_label | str | positive / neutral / negative |
| sentiment_score | float | Max softmax confidence |
| themes | str | Detected themes (CSV): hospitality, landscape, culture, food, safety, price, infrastructure |
## Schema — videos.parquet
| Column | Type | Description |
|---|---|---|
| video_id | str | YouTube video ID |
| channel_id | str | Channel ID |
| channel_name | str | Channel name |
| subscriber_count | int | Channel subscriber count at scrape time |
| title | str | Video title |
| description | str | First 300 chars of video description |
| published_at | str | Publication date |
| view_count | int | Views at scrape time |
| like_count | int | Likes at scrape time |
| comment_count | int | Comment count at scrape time |
| algorithm_score | float | view_count / subscriber_count |
| url | str | YouTube URL |
## v2 distribution
- **Sentiment**: 11,141 positive / 5,639 negative / 3,870 neutral
- **Dominant languages**: EN, FR, PT, ES, TR — see `comments.parquet` for the full breakdown
## Regions covered
| Region | Videos | Approx. comments |
|---|---|---|
| Benin (v1) | 11 | ~6,000 |
| Africa (Egypt, Nigeria, Morocco, etc.) | 7 | ~5,500 |
| Latin America (Colombia, Brazil, Venezuela, Peru) | 6 | ~5,500 |
| Turkey | 2 | ~1,800 |
## Usage
```python
import pandas as pd
comments = pd.read_parquet("comments.parquet")
videos = pd.read_parquet("videos.parquet")
# Example: positive comments about food
food_pos = comments[
(comments.sentiment_label == "positive") &
(comments.themes.str.contains("food", na=False))
]
```
From the Hugging Face Hub:
```python
from huggingface_hub import hf_hub_download
import pandas as pd
path = hf_hub_download(
repo_id="honvoh-school/youtube-tourism-sentiment",
filename="comments.parquet",
repo_type="dataset",
)
df = pd.read_parquet(path)
```
## Preprocessing applied
- Unicode NFKC normalization
- URL, @mention, and zero-width character stripping
- Abbreviation expansion (btw → by the way, omg → oh my god, etc.)
- **Emojis kept** (sentiment signal)
- Language detection via `langdetect`
- Deduplication on same-author + same-text (copy-paste spam)
## Source
Scraped via the official **YouTube Data API v3** (`commentThreads.list` + `videos.list` + `channels.list`). Free within the default 10,000 units/day quota. Switched from Apify in v2 to stay within budget.
Pipeline code: https://github.com/lifestyleentrepreneur/youtube-tourism-sentiment (private)
## Version history
- **v1** (2026-04-17): 6,355 comments on 11 Benin tourism videos
- **v2** (2026-04-19): +14,295 comments from Africa / LatAm / Turkey videos → 20,650 total
## License
Comments are original user-generated content on YouTube and remain the property of their authors. This dataset is a derived work provided for **non-commercial academic research only** (CC BY-NC 4.0). Do not redistribute or use for commercial purposes.
提供机构:
honvoh-school



