bendavidsteel/douyin
收藏Hugging Face2026-01-30 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/bendavidsteel/douyin
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
task_categories:
- text-classification
- feature-extraction
language:
- zh
size_categories:
- 1M<n<10M
---
# Douyin Posts Dataset
A dataset of **1,133,545 unique posts** from Douyin (Chinese TikTok), collected via snowball sampling of related videos.
## Dataset Description
This dataset contains metadata from Douyin short videos, collected by crawling related video recommendations. Starting from seed videos matching specific keywords, the crawler iteratively fetched related videos to build a diverse sample of the platform's content.
### Collection Method
- **Snowball sampling**: Starting from keyword-matched seed videos, related videos were iteratively crawled
- **Deduplication**: All posts are deduplicated by `aweme_id` (unique video identifier)
- **Partitioning**: Data is partitioned by the first 4 digits of `aweme_id` for efficient access
## Dataset Structure
The dataset contains 187 columns. Key fields include:
### Core Fields
| Field | Type | Description |
|-------|------|-------------|
| `aweme_id` | string | Unique video identifier |
| `desc` | string | Video description/caption |
| `caption` | string | Short caption |
| `create_time` | int64 | Unix timestamp of video creation |
| `duration` | int64 | Video duration in milliseconds |
| `region` | string | Geographic region code |
### Engagement Statistics (`statistics` struct)
| Field | Description |
|-------|-------------|
| `digg_count` | Number of likes |
| `comment_count` | Number of comments |
| `share_count` | Number of shares |
| `play_count` | Number of views |
| `collect_count` | Number of saves/bookmarks |
### Author Information (`author` struct)
| Field | Description |
|-------|-------------|
| `uid` | Author user ID |
| `nickname` | Display name |
| `sec_uid` | Secondary user ID |
| `signature` | Author bio |
| `follower_status` | Follower relationship status |
### Content Metadata
| Field | Type | Description |
|-------|------|-------------|
| `cha_list` | list | Hashtag/challenge information |
| `text_extra` | list | Extracted hashtags, mentions, and links |
| `music` | struct | Audio/music metadata |
| `video` | struct | Video file metadata (URLs, dimensions, formats) |
| `poi_info` | struct | Location/point of interest data |
## Usage
```python
import polars as pl
from huggingface_hub import hf_hub_download
from pathlib import Path
# Download a single partition
file_path = hf_hub_download(
repo_id="bendavidsteel/douyin",
filename="data/partition_7500.parquet.zstd",
repo_type="dataset"
)
df = pl.read_parquet(file_path)
# Or load all partitions
from huggingface_hub import snapshot_download
local_dir = snapshot_download(
repo_id="bendavidsteel/douyin",
repo_type="dataset",
allow_patterns="data/*.parquet.zstd"
)
df = pl.read_parquet(Path(local_dir) / "data" / "*.parquet.zstd")
```
## File Structure
```
data/
├── partition_7035.parquet.zstd
├── partition_7040.parquet.zstd
├── ...
└── partition_7551.parquet.zstd
```
111 partition files, partitioned by the first 4 digits of `aweme_id`.
## Limitations
- Metadata only - does not include actual video/image content
- Point-in-time snapshot - engagement statistics reflect collection time
- Related video sampling may introduce biases toward popular/recommended content
## License
CC-BY-NC-4.0 (Creative Commons Attribution-NonCommercial 4.0)
提供机构:
bendavidsteel



