l3afai/youtube-thumbnails

Name: l3afai/youtube-thumbnails
Creator: l3afai
Published: 2026-03-26 13:43:28
License: 暂无描述

Hugging Face2026-03-26 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/l3afai/youtube-thumbnails

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en tags: - youtube - thumbnails - image-text - multimodal - text-to-image - image-to-text - captioning - weak-supervision - large-scale - computer-vision - nlp - vision-language - clip-training - diffusion - generative-models - image-generation - thumbnail-generation - social-media - content-creation - visual-design - high-contrast - faces - expressions - memes - clickbait - marketing - advertising - attention-modeling - representation-learning - embedding - retrieval - search - ranking - dataset-creation - public-data - self-supervised - weak-labels - noisy-labels - english - filtered - deduplicated - large-dataset - research - experimental - open-data - vision - multimodal-learning - image-dataset - text-dataset pretty_name: Youtube Thumbnails task_categories: - text-to-image - image-to-text - feature-extraction size_categories: - 100K<n<1M license: other --- # YouTube Thumbnails Dataset ## Dataset Details ### Dataset Description This dataset contains approximately **164,000 YouTube thumbnails** paired with their corresponding video titles. The dataset was constructed by collecting public YouTube channel feeds, extracting video metadata, filtering and deduplicating entries, and downloading thumbnail images at scale. The goal of this dataset is to support research and experimentation in: - Image generation (e.g. diffusion models) - Multimodal learning (e.g. CLIP-style models) - Thumbnail generation and optimization - Image-text representation learning --- - **Curated by:** l3af (Discord: l3afai) - **Language(s):** English (filtered using language detection) - **License:** Derived from publicly available YouTube data. Users are responsible for complying with YouTube's Terms of Service. --- ## Dataset Sources - **Source:** Public YouTube RSS feeds (`videos.xml`) - **Images:** YouTube thumbnail CDN (`i.ytimg.com`) - **Metadata:** Video titles and IDs --- ## Uses ### Direct Use This dataset is suitable for: - Training image generation models (especially thumbnail-style generation) - Training multimodal embedding models (e.g. CLIP) - Studying social-media visual patterns - Thumbnail generation or ranking systems --- ### Out-of-Scope Use This dataset is **not recommended for**: - High-quality caption-to-image generation (titles are not descriptive captions) - Tasks requiring precise semantic grounding - Sensitive or safety-critical applications --- ## Dataset Structure Each example contains: - `video_id` (string): YouTube video identifier - `title` (string): Video title - `image` (image): Thumbnail image --- ## Dataset Creation ### Curation Rationale This dataset was created to provide a large-scale collection of real-world image-text pairs with strong visual patterns, particularly useful for studying: - Attention-grabbing design - High-contrast visual composition - Social media aesthetics --- ### Source Data #### Data Collection and Processing The dataset was created through the following pipeline: 1. Collected ~22,000 YouTube channel IDs 2. Downloaded RSS feeds (`videos.xml`) 3. Extracted video metadata 4. Filtered: - Removed Shorts content - Removed non-English titles (via language detection) 5. Deduplicated titles (exact + fuzzy) 6. Downloaded thumbnail images (max resolution when available) 7. Built dataset in multiple formats (Parquet + HF dataset) --- #### Who are the source data producers? The source data was originally created by YouTube content creators across a wide range of domains, including entertainment, education, gaming, and news. --- ### Annotations No manual annotations were added. The dataset consists solely of: - Original thumbnails - Original video titles --- ### Personal and Sensitive Information - Some thumbnails may contain human faces or identifiable individuals - Titles and images may reflect biases from content creators - No additional personal data was intentionally collected --- ## Data Traceability Each entry includes a `video_id` which uniquely identifies the original YouTube video. Users can reconstruct the original source via: https://www.youtube.com/watch?v={video_id} This enables: - Attribution to original creators - Verification of data origin - Selective filtering or removal --- ## Bias, Risks, and Limitations - Strong bias toward YouTube-style content (faces, text overlays, high contrast) - Titles are often: - Clickbait - Vague - Non-descriptive - Images frequently contain embedded text (which models struggle to generate correctly) - Distribution may not reflect real-world image diversity --- ### Recommendations - Use for **style-focused tasks**, not semantic grounding - Consider augmenting with caption datasets for better text alignment - Filter further if targeting specific domains --- ## Citation If you use this dataset, please cite: ``` l3afai. (2026). YouTube Thumbnails Dataset. ``` ## Dataset Card Authors - l3af (Discord: l3afai) ## License and Attribution This dataset contains images and metadata derived from publicly available YouTube content. - All rights to the original thumbnails belong to their respective creators. - This dataset does not claim ownership of any images. - Each sample includes a `video_id` which can be used to trace the original source: https://www.youtube.com/watch?v={video_id} This dataset is provided for research and educational purposes only. If you are a content owner and would like your data removed, please contact the dataset maintainer. ## Takedown Policy If you are a rights holder and wish to have content removed from this dataset, please contact the maintainer with the relevant `video_id`(s). The content will be removed. ## Dataset Card Contact For questions or issues, contact: - Discord: l3afai

提供机构：

l3afai

5,000+

优质数据集

54 个

任务类型

进入经典数据集