five

bendavidsteel/douyin

收藏
Hugging Face2026-01-30 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/bendavidsteel/douyin
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 task_categories: - text-classification - feature-extraction language: - zh size_categories: - 1M<n<10M --- # Douyin Posts Dataset A dataset of **1,133,545 unique posts** from Douyin (Chinese TikTok), collected via snowball sampling of related videos. ## Dataset Description This dataset contains metadata from Douyin short videos, collected by crawling related video recommendations. Starting from seed videos matching specific keywords, the crawler iteratively fetched related videos to build a diverse sample of the platform's content. ### Collection Method - **Snowball sampling**: Starting from keyword-matched seed videos, related videos were iteratively crawled - **Deduplication**: All posts are deduplicated by `aweme_id` (unique video identifier) - **Partitioning**: Data is partitioned by the first 4 digits of `aweme_id` for efficient access ## Dataset Structure The dataset contains 187 columns. Key fields include: ### Core Fields | Field | Type | Description | |-------|------|-------------| | `aweme_id` | string | Unique video identifier | | `desc` | string | Video description/caption | | `caption` | string | Short caption | | `create_time` | int64 | Unix timestamp of video creation | | `duration` | int64 | Video duration in milliseconds | | `region` | string | Geographic region code | ### Engagement Statistics (`statistics` struct) | Field | Description | |-------|-------------| | `digg_count` | Number of likes | | `comment_count` | Number of comments | | `share_count` | Number of shares | | `play_count` | Number of views | | `collect_count` | Number of saves/bookmarks | ### Author Information (`author` struct) | Field | Description | |-------|-------------| | `uid` | Author user ID | | `nickname` | Display name | | `sec_uid` | Secondary user ID | | `signature` | Author bio | | `follower_status` | Follower relationship status | ### Content Metadata | Field | Type | Description | |-------|------|-------------| | `cha_list` | list | Hashtag/challenge information | | `text_extra` | list | Extracted hashtags, mentions, and links | | `music` | struct | Audio/music metadata | | `video` | struct | Video file metadata (URLs, dimensions, formats) | | `poi_info` | struct | Location/point of interest data | ## Usage ```python import polars as pl from huggingface_hub import hf_hub_download from pathlib import Path # Download a single partition file_path = hf_hub_download( repo_id="bendavidsteel/douyin", filename="data/partition_7500.parquet.zstd", repo_type="dataset" ) df = pl.read_parquet(file_path) # Or load all partitions from huggingface_hub import snapshot_download local_dir = snapshot_download( repo_id="bendavidsteel/douyin", repo_type="dataset", allow_patterns="data/*.parquet.zstd" ) df = pl.read_parquet(Path(local_dir) / "data" / "*.parquet.zstd") ``` ## File Structure ``` data/ ├── partition_7035.parquet.zstd ├── partition_7040.parquet.zstd ├── ... └── partition_7551.parquet.zstd ``` 111 partition files, partitioned by the first 4 digits of `aweme_id`. ## Limitations - Metadata only - does not include actual video/image content - Point-in-time snapshot - engagement statistics reflect collection time - Related video sampling may introduce biases toward popular/recommended content ## License CC-BY-NC-4.0 (Creative Commons Attribution-NonCommercial 4.0)
提供机构:
bendavidsteel
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作