five

Shofo/shofo-tiktok-general-small

收藏
Hugging Face2026-02-19 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/Shofo/shofo-tiktok-general-small
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - video-classification - text-generation - audio-classification language: - en - es tags: - short-form - video - transcripts - multimodal size_categories: - 10K<n<100K license: other --- # Shofo TikTok General (Small) ## Overview **Shofo TikTok General (Small)** is a dataset containing **50,000 TikTok videos** with comprehensive metadata, transcripts, comments, and engagement metrics. This is a curated subset of Shofo's larger TikTok index, which contains hundreds of millions of indexed videos. - **Size**: \~50K videos (\~500GB) - **Modality**: Video + Audio + Text (transcripts, comments, captions) - **Source**: TikTok ## Schema | Column | Type | Description | |--------|------|-------------| | `file_name` | string | Relative path to video file (e.g., `videos/123.mp4`) | | `video_id` | string | Unique TikTok video identifier | | `web_url` | string | TikTok web URL for the video | | `creator` | string | Creator username | | `transcript` | string | Audio transcription (ASR-generated, may be null) | | `description` | string | Video caption/description | | `hashtags` | JSON array | List of hashtags used | | `sticker_text` | JSON array | Text overlays/stickers visible in video | | `comments` | JSON array | Top comments with metadata (see below) | | `engagement_metrics` | JSON object | View counts, likes, shares, etc. (see below) | | `date_posted` | timestamp | When the video was originally posted | | `language` | JSON object | Language detection info (see below) | | `fps` | int | Frames per second | | `resolution` | string | Video resolution (e.g., `1080x1920`) | | `duration_ms` | int | Video duration in milliseconds | | `is_ai_generated` | bool | Whether the video was labeled as AI-generated | | `is_ad` | bool | Whether the video is an advertisement | ### Engagement Metrics Structure ```json { "play_count": 8948070, "like_count": 789584, "comment_count": 1451, "share_count": 38604, "collect_count": 126905, "repost_count": 0, "download_count": 235172, "whatsapp_share_count": 15737 } ``` ### Comments Structure Each comment in the `comments` array contains: ```json { "cid": "7352452026457342726", "text": "Comment text here", "create_time": 1711876158, "like_count": 885, "reply_count": 9, "username": "commenter_username", "user_region": "MX", "language": "es" } ``` ### Language Structure ```json { "desc_language": "es", "sticker_language": "en", "region": "US", "author_region": "US", "original_audio_language": null } ``` ## Collection Methodology Videos were collected through Shofo's TikTok indexing pipeline: 1. **Discovery**: Creators and hashtags are discovered through an explore/exploit strategy, snowballing from seed accounts 2. **Indexing**: Video metadata is fetched via TikTok's API 3. **Transcription**: Audio is transcribed using automatic speech recognition (ASR) 4. **Deduplication**: Videos are deduplicated using Redis-based ID tracking This subset represents a curated sample from the larger index, selected for data quality and diversity. ## Usage ### Using HuggingFace Datasets Library ```python from datasets import load_dataset ds = load_dataset("Shofo/shofo-tiktok-general-small", split="train") # Access a sample sample = ds[0] print(sample["transcript"]) print(sample["description"]) print(sample["engagement_metrics"]) ``` ### Using Pandas ```python import pandas as pd df = pd.read_parquet("hf://datasets/Shofo/shofo-tiktok-general-small/metadata.parquet") # Filter by engagement popular = df[df['engagement_metrics'].apply(lambda x: x['play_count'] > 1000000)] ``` ### Accessing Videos Videos are stored in the `videos/` directory and linked via the `file_name` column: ```python from datasets import load_dataset ds = load_dataset("Shofo/shofo-tiktok-general-small", split="train") # Get video path video_path = ds[0]["file_name"] # e.g., "videos/7350916080610643231.mp4" ``` ## Notes - **Compression**: Tiktok automatically uses H264 compression on its videos, achieving \~50x slightly lossy compression. - **Engagement metrics**: Values are from time of indexing - **Comments**: Top 50 comments at time of indexing - **Nulls**: Some fields may be null (e.g., `transcript` if no speech, `sticker_text` if no overlays) ## Larger Versions This is the "small" version of the Shofo TikTok dataset. Larger versions are available: - **Shofo TikTok General (Medium)**: 10m+ videos - **Shofo TikTok General (Large)**: 100M+ videos ## Citation ```bibtex @dataset{shofo_tiktok_general_small_2025, title={Shofo TikTok General (Small)}, author={Shofo}, year={2025}, url={https://huggingface.co/datasets/Shofo/shofo-tiktok-general-small} } ``` ## License & Disclaimer This dataset is provided for research and experimental use. Shofo does not claim ownership of the underlying video content. Users are responsible for ensuring compliance with applicable copyright laws and platform terms when using this dataset.
提供机构:
Shofo
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作