Shofo/shofo-tiktok-general-small
收藏Hugging Face2026-02-19 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/Shofo/shofo-tiktok-general-small
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- video-classification
- text-generation
- audio-classification
language:
- en
- es
tags:
- short-form
- video
- transcripts
- multimodal
size_categories:
- 10K<n<100K
license: other
---
# Shofo TikTok General (Small)
## Overview
**Shofo TikTok General (Small)** is a dataset containing **50,000 TikTok videos** with comprehensive metadata, transcripts, comments, and engagement metrics. This is a curated subset of Shofo's larger TikTok index, which contains hundreds of millions of indexed videos.
- **Size**: \~50K videos (\~500GB)
- **Modality**: Video + Audio + Text (transcripts, comments, captions)
- **Source**: TikTok
## Schema
| Column | Type | Description |
|--------|------|-------------|
| `file_name` | string | Relative path to video file (e.g., `videos/123.mp4`) |
| `video_id` | string | Unique TikTok video identifier |
| `web_url` | string | TikTok web URL for the video |
| `creator` | string | Creator username |
| `transcript` | string | Audio transcription (ASR-generated, may be null) |
| `description` | string | Video caption/description |
| `hashtags` | JSON array | List of hashtags used |
| `sticker_text` | JSON array | Text overlays/stickers visible in video |
| `comments` | JSON array | Top comments with metadata (see below) |
| `engagement_metrics` | JSON object | View counts, likes, shares, etc. (see below) |
| `date_posted` | timestamp | When the video was originally posted |
| `language` | JSON object | Language detection info (see below) |
| `fps` | int | Frames per second |
| `resolution` | string | Video resolution (e.g., `1080x1920`) |
| `duration_ms` | int | Video duration in milliseconds |
| `is_ai_generated` | bool | Whether the video was labeled as AI-generated |
| `is_ad` | bool | Whether the video is an advertisement |
### Engagement Metrics Structure
```json
{
"play_count": 8948070,
"like_count": 789584,
"comment_count": 1451,
"share_count": 38604,
"collect_count": 126905,
"repost_count": 0,
"download_count": 235172,
"whatsapp_share_count": 15737
}
```
### Comments Structure
Each comment in the `comments` array contains:
```json
{
"cid": "7352452026457342726",
"text": "Comment text here",
"create_time": 1711876158,
"like_count": 885,
"reply_count": 9,
"username": "commenter_username",
"user_region": "MX",
"language": "es"
}
```
### Language Structure
```json
{
"desc_language": "es",
"sticker_language": "en",
"region": "US",
"author_region": "US",
"original_audio_language": null
}
```
## Collection Methodology
Videos were collected through Shofo's TikTok indexing pipeline:
1. **Discovery**: Creators and hashtags are discovered through an explore/exploit strategy, snowballing from seed accounts
2. **Indexing**: Video metadata is fetched via TikTok's API
3. **Transcription**: Audio is transcribed using automatic speech recognition (ASR)
4. **Deduplication**: Videos are deduplicated using Redis-based ID tracking
This subset represents a curated sample from the larger index, selected for data quality and diversity.
## Usage
### Using HuggingFace Datasets Library
```python
from datasets import load_dataset
ds = load_dataset("Shofo/shofo-tiktok-general-small", split="train")
# Access a sample
sample = ds[0]
print(sample["transcript"])
print(sample["description"])
print(sample["engagement_metrics"])
```
### Using Pandas
```python
import pandas as pd
df = pd.read_parquet("hf://datasets/Shofo/shofo-tiktok-general-small/metadata.parquet")
# Filter by engagement
popular = df[df['engagement_metrics'].apply(lambda x: x['play_count'] > 1000000)]
```
### Accessing Videos
Videos are stored in the `videos/` directory and linked via the `file_name` column:
```python
from datasets import load_dataset
ds = load_dataset("Shofo/shofo-tiktok-general-small", split="train")
# Get video path
video_path = ds[0]["file_name"] # e.g., "videos/7350916080610643231.mp4"
```
## Notes
- **Compression**: Tiktok automatically uses H264 compression on its videos, achieving \~50x slightly lossy compression.
- **Engagement metrics**: Values are from time of indexing
- **Comments**: Top 50 comments at time of indexing
- **Nulls**: Some fields may be null (e.g., `transcript` if no speech, `sticker_text` if no overlays)
## Larger Versions
This is the "small" version of the Shofo TikTok dataset. Larger versions are available:
- **Shofo TikTok General (Medium)**: 10m+ videos
- **Shofo TikTok General (Large)**: 100M+ videos
## Citation
```bibtex
@dataset{shofo_tiktok_general_small_2025,
title={Shofo TikTok General (Small)},
author={Shofo},
year={2025},
url={https://huggingface.co/datasets/Shofo/shofo-tiktok-general-small}
}
```
## License & Disclaimer
This dataset is provided for research and experimental use.
Shofo does not claim ownership of the underlying video content.
Users are responsible for ensuring compliance with applicable copyright laws and platform terms when using this dataset.
提供机构:
Shofo



