five

honvoh-school/youtube-tourism-sentiment

收藏
Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/honvoh-school/youtube-tourism-sentiment
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 task_categories: - text-classification language: - en - fr - tr - es - de - pt tags: - youtube - sentiment - tourism - multilingual - africa - latam pretty_name: YouTube Tourism Sentiment (multilingual) size_categories: - 10K<n<100K --- # YouTube Tourism Sentiment (multilingual) YouTube comments from tourism-related videos across Africa, Latin America, and Turkey, cleaned and pre-labeled with `cardiffnlp/twitter-xlm-roberta-base-sentiment`. Built for a school LLM project (Farel Honvoh / Elif Oksuzali). ## Contents - **comments.parquet** — one row per comment, joined with video and channel metadata - **videos.parquet** — video-level metadata with `algorithm_score = view_count / subscriber_count` ## v2 snapshot - **20,650 comments** across **26 videos** from **18 channels** - Scope: tourism content in Africa (Benin, Egypt, Nigeria, etc.), Latin America (Colombia, Brazil, Venezuela, Peru), and Turkey - Labels generated by a multilingual pre-trained model (no human labeling) ## Schema — comments.parquet | Column | Type | Description | |---|---|---| | comment_id | str | Truncated SHA1 of (video_id + author + text[:100]) | | video_id | str | YouTube video ID | | channel_id | str | Channel ID | | channel_name | str | Channel name | | video_title | str | Video title | | algorithm_score | float | views / subscribers (proxy for video over-performance) | | author | str | Comment author username | | text_raw | str | Original text (emojis preserved) | | text_clean | str | Cleaned text (URLs and mentions stripped, abbreviations expanded) | | lang | str | ISO language code (langdetect) | | likes | int | Like count | | reply_count | int | Reply count | | published_at | str | Publication date | | is_reply | int | 1 if reply to another comment, 0 otherwise | | sentiment_label | str | positive / neutral / negative | | sentiment_score | float | Max softmax confidence | | themes | str | Detected themes (CSV): hospitality, landscape, culture, food, safety, price, infrastructure | ## Schema — videos.parquet | Column | Type | Description | |---|---|---| | video_id | str | YouTube video ID | | channel_id | str | Channel ID | | channel_name | str | Channel name | | subscriber_count | int | Channel subscriber count at scrape time | | title | str | Video title | | description | str | First 300 chars of video description | | published_at | str | Publication date | | view_count | int | Views at scrape time | | like_count | int | Likes at scrape time | | comment_count | int | Comment count at scrape time | | algorithm_score | float | view_count / subscriber_count | | url | str | YouTube URL | ## v2 distribution - **Sentiment**: 11,141 positive / 5,639 negative / 3,870 neutral - **Dominant languages**: EN, FR, PT, ES, TR — see `comments.parquet` for the full breakdown ## Regions covered | Region | Videos | Approx. comments | |---|---|---| | Benin (v1) | 11 | ~6,000 | | Africa (Egypt, Nigeria, Morocco, etc.) | 7 | ~5,500 | | Latin America (Colombia, Brazil, Venezuela, Peru) | 6 | ~5,500 | | Turkey | 2 | ~1,800 | ## Usage ```python import pandas as pd comments = pd.read_parquet("comments.parquet") videos = pd.read_parquet("videos.parquet") # Example: positive comments about food food_pos = comments[ (comments.sentiment_label == "positive") & (comments.themes.str.contains("food", na=False)) ] ``` From the Hugging Face Hub: ```python from huggingface_hub import hf_hub_download import pandas as pd path = hf_hub_download( repo_id="honvoh-school/youtube-tourism-sentiment", filename="comments.parquet", repo_type="dataset", ) df = pd.read_parquet(path) ``` ## Preprocessing applied - Unicode NFKC normalization - URL, @mention, and zero-width character stripping - Abbreviation expansion (btw → by the way, omg → oh my god, etc.) - **Emojis kept** (sentiment signal) - Language detection via `langdetect` - Deduplication on same-author + same-text (copy-paste spam) ## Source Scraped via the official **YouTube Data API v3** (`commentThreads.list` + `videos.list` + `channels.list`). Free within the default 10,000 units/day quota. Switched from Apify in v2 to stay within budget. Pipeline code: https://github.com/lifestyleentrepreneur/youtube-tourism-sentiment (private) ## Version history - **v1** (2026-04-17): 6,355 comments on 11 Benin tourism videos - **v2** (2026-04-19): +14,295 comments from Africa / LatAm / Turkey videos → 20,650 total ## License Comments are original user-generated content on YouTube and remain the property of their authors. This dataset is a derived work provided for **non-commercial academic research only** (CC BY-NC 4.0). Do not redistribute or use for commercial purposes.
提供机构:
honvoh-school
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作