VK-LSVD
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/deepvk/VK-LSVD
下载链接
链接失效反馈官方服务:
资源简介:
# VK-LSVD: Large Short-Video Dataset
**VK-LSVD** is the largest open industrial short-video recommendation dataset with real-world interactions:
- **40B** unique user–item interactions with rich feedback (`timespent`, `like`, `dislike`, `share`, `bookmark`, `click_on_author`, `open_comments`) and
context (`place`, `platform`, `agent`);
- **10M** users (with `age`, `gender`, `geo`);
- **20M** short videos (with `duration`, `author_id`, content `embedding`);
- **Global Temporal Ordering** across **six consecutive months** of user interactions.
**Why short video?** Users often watch dozens of clips per session, producing dense, time-ordered signals well suited for modeling.
Unlike music, podcasts, or long-form video, which are often consumed in the background, short videos are foreground by design. They also do not exhibit repeat exposure.
Even without explicit feedback, signals such as skips, completions, and replays yield strong implicit labels.
Single-item feeds also simplify attribution and reduce confounding compared with multi-item layouts.
---
> **Note:** The test set will be released after the upcoming challenge.
---
[📊 Basic Statistics](#basic-statistics) • [🧱 Data Description](#data-description) • [⚡ Quick Start](#quick-start) • [🧩 Configurable Subsets](#configurable-subsets)
---
## Basic Statistics
- Users **10,000,000**
- Items **19,627,601**
- Unique interactions **40,774,024,903**
- Interactions density **0.0208%**
- Total watch time: **858,160,100,084 s**
- Likes: **1,171,423,458**
- Dislikes: **11,860,138**
- Shares: **262,734,328**
- Bookmarks: **40,124,463**
- Clicks on author: **84,632,666**
- Comment opens: **481,251,593**
---
## Data Description
**Privacy-preserving taxonomy** — all categorical metadata (`user_id`, `geo`, `item_id`, `author_id`, `place`, `platform`, `agent`) is anonymized into stable integer IDs (consistent across splits; no reverse mapping provided).
### Interactions
[interactions](https://huggingface.co/datasets/deepvk/VK-LSVD/tree/main/interactions)
Each row is one observation (a short video shown to a user) with feedback and context. There are no repeated exposures of the same user–item pair.
**Global Temporal Split (GTS):** `train` / `validation` / `test` preserve time order — train on the past, validate/test on the future.
**Chronology:** Files are organized by weeks (e.g., week_XX.parquet); rows within each file are in increasing timestamp order.
| Field | Type | Description |
|-----|----|-----------|
|`user_id`|uint32|User identifier|
|`item_id`|uint32|Video identifier|
|`place`|uint8|Place: feed/search/group/… (24 ids)|
|`platform`|uint8|Platform: Android/Web/TV/… (11 ids) |
|`agent`|uint8|Agent/client: browser/app (29 ids)|
|`timespent`|uint8|Watch time (0–255 seconds)|
|`like`|boolean|User liked the video|
|`dislike`|boolean|User disliked the video|
|`share`|boolean|User shared the video|
|`bookmark`|boolean|User bookmarked the video|
|`click_on_author`|boolean|User opened author page|
|`open_comments`|boolean|User opened the comments section |
### Users metadata
[users_metadata.parquet](metadata/users_metadata.parquet)
| Field | Type | Description |
|-----|----|-----------|
|`user_id`|uint32|User identifier|
|`age`|uint8|Age (18-70 years)|
|`gender`|uint8|Gender|
|`geo`|uint8|Most frequent user location (80 ids)|
|`train_interactions_rank`|uint32|Popularity rank for sampling (lower = more interactions)|
### Items metadata
[items_metadata.parquet](metadata/items_metadata.parquet)
| Field | Type | Description |
|-----|----|-----------|
|`item_id`|uint32|Video identifier|
|`author_id`|uint32|Author identifier|
|`duration`|uint8|Video duration (seconds)|
|`train_interactions_rank`|uint32|Popularity rank for sampling (lower = more interactions)|
### Embeddings: variable width
**Embeddings are trained strictly on content** (video/description/audio, etc.) — no collaborative signal mixed in.
**Components are ordered**: the _dot product_ of the first n components approximates the _cosine_ similarity of the original production embeddings.
This lets researchers pick any dimensionality (**1…64**) to trade quality for speed and memory.
[item_embeddings.npz](metadata/item_embeddings.npz)
| Field | Type | Description |
|-----|----|-----------|
|`item_id`|uint32|Video identifier|
|`embedding`|float16[64]|Item content embedding with ordered components|
---
## Quick Start
### Load a small subsample
```python
from huggingface_hub import hf_hub_download
import polars as pl
import numpy as np
subsample_name = 'up0.001_ip0.001'
content_embedding_size = 32
train_interactions_files = [f'subsamples/{subsample_name}/train/week_{i:02}.parquet'
for i in range(25)]
val_interactions_file = [f'subsamples/{subsample_name}/validation/week_25.parquet']
metadata_files = ['metadata/users_metadata.parquet',
'metadata/items_metadata.parquet',
'metadata/item_embeddings.npz']
for file in (train_interactions_files +
val_interactions_file +
metadata_files):
hf_hub_download(
repo_id='deepvk/VK-LSVD', repo_type='dataset',
filename=file, local_dir='VK-LSVD'
)
train_interactions = pl.concat([pl.scan_parquet(f'VK-LSVD/{file}')
for file in train_interactions_files])
train_interactions = train_interactions.collect(engine='streaming')
val_interactions = pl.read_parquet(f'VK-LSVD/{val_interactions_file[0]}')
train_users = train_interactions.select('user_id').unique()
train_items = train_interactions.select('item_id').unique()
item_ids = np.load('VK-LSVD/metadata/item_embeddings.npz')['item_id']
item_embeddings = np.load('VK-LSVD/metadata/item_embeddings.npz')['embedding']
mask = np.isin(item_ids, train_items.to_numpy())
item_ids = item_ids[mask]
item_embeddings = item_embeddings[mask]
item_embeddings = item_embeddings[:, :content_embedding_size]
users_metadata = pl.read_parquet('VK-LSVD/metadata/users_metadata.parquet')
items_metadata = pl.read_parquet('VK-LSVD/metadata/items_metadata.parquet')
users_metadata = users_metadata.join(train_users, on='user_id')
items_metadata = items_metadata.join(train_items, on='item_id')
items_metadata = items_metadata.join(pl.DataFrame({'item_id': item_ids,
'embedding': item_embeddings}),
on='item_id')
```
---
## Configurable Subsets
We provide several ready-made slices and simple utilities to compose your own subset that matches your task, data budget, and hardware.
You can control density via popularity quantiles (`train_interactions_rank`), draw random users,
or pick specific time windows — while preserving the Global Temporal Split.
Representative subsamples are provided for quick experiments:
| Subset | Users | Items | Interactions | Density |
|-----|----:|-----------:|-----------:|-----------:|
|`whole`|10,000,000|19,627,601|40,774,024,903|0.0208%|
|`ur0.1`|1,000,000|18,701,510|4,066,457,259|0.0217%|
|`ur0.01`|100,000|12,467,302|407,854,360|0.0327%|
|`ur0.01_ir0.01`|90,178|125,018|4,044,900|0.0359%|
|`up0.01_ir0.01`|100000|171106|38,404,921|0.2245%|
|`ur0.01_ip0.01`|99,893|196,277|191,625,941|0.9774%|
|`up0.01_ip0.01`|100,000|196,277|1,417,906,344|7.2240%|
|`up0.001_ip0.001`|10,000|19,628|47,976,280|24.4428%|
|`up-0.9_ip-0.9`|8,939,432|17,654,817|2,861,937,212|0.0018%|
- `urX` — X fraction of **r**andom **u**sers (e.g., `ur0.01` = 1% of users).
- `ipX` — X fraction of **p**opular **i**tems (by `train_interactions_rank`)
- Negative X denotes the least-popular fraction (e.g., `−0.9` → bottom 90%).
For example, to get [ur0.01_ip0.01](https://huggingface.co/datasets/deepvk/VK-LSVD/tree/main/subsamples/ur0.01_ip0.01) (1% of **r**andom **u**sers, 1% of most **p**opular **i**tems) use the snippet below.
```python
import polars as pl
def get_sample(entries: pl.DataFrame, split_column: str, fraction: float) -> pl.DataFrame:
if fraction >= 0:
entries = entries.filter(pl.col(split_column) <=
pl.col(split_column).quantile(fraction,
interpolation='midpoint'))
else:
entries = entries.filter(pl.col(split_column) >=
pl.col(split_column).quantile(1 + fraction,
interpolation='midpoint'))
return entries
users = pl.scan_parquet('VK-LSVD/metadata/users_metadata.parquet')
users_sample = get_sample(users, 'user_id', 0.01).select(['user_id'])
items = pl.scan_parquet('VK-LSVD/metadata/items_metadata.parquet')
items_sample = get_sample(items, 'train_interactions_rank', 0.01).select(['item_id'])
interactions = pl.scan_parquet('VK-LSVD/interactions/validation/week_25.parquet')
interactions = interactions.join(users_sample, on='user_id', maintain_order='left')
interactions = interactions.join(items_sample, on='item_id', maintain_order='left')
interactions_sample = interactions.collect(engine='streaming')
```
To get [up-0.9_ip-0.9](https://huggingface.co/datasets/deepvk/VK-LSVD/tree/main/subsamples/up-0.9_ip-0.9) (90% of least **p**opular **u**sers, 90% of least **p**opular **i**tems) replace users and items sampling lines with
```python
users_sample = get_sample(users, 'train_interactions_rank', -0.9).select(['user_id'])
items_sample = get_sample(items, 'train_interactions_rank', -0.9).select(['item_id'])
```
# VK-LSVD:大型短视频数据集(Large Short-Video Dataset)
**VK-LSVD**是目前规模最大的包含真实交互数据的开源工业级短视频推荐数据集:
- **400亿**(40B)唯一用户-物品交互数据,包含丰富反馈(观看时长(timespent)、点赞(like)、点踩(dislike)、分享(share)、收藏(bookmark)、点击作者主页(click_on_author)、打开评论区(open_comments))与上下文信息(展示场景(place)、终端平台(platform)、客户端类型(agent));
- **1000万**(10M)用户,带有年龄(age)、性别(gender)、地域(geo)属性;
- **2000万**(20M)短视频,带有时长(duration)、作者ID(author_id)、内容嵌入向量(embedding);
- 覆盖连续六个月用户交互的**全局时间顺序(Global Temporal Ordering)**。
### 为何选择短视频?
用户单次会话通常会观看数十条短视频片段,产生密集且时序有序的交互信号,非常适合建模任务。与音乐、播客或长视频这类常作为后台播放的内容不同,短视频从设计上即为前台聚焦型内容,且不会出现重复曝光的情况。即便没有显式反馈,跳过、完整观看、重播等行为也可作为强有力的隐式标签。与多物品布局相比,单物品信息流也更易于归因分析,减少混杂因素干扰。
---
> **注意:** 测试集将在即将到来的挑战赛结束后发布。
---
[📊 基础统计信息](#basic-statistics) • [🧱 数据说明](#data-description) • [⚡ 快速上手](#quick-start) • [🧩 可配置子集](#configurable-subsets)
---
## 基础统计信息
- 用户数:10,000,000
- 物品数:19,627,601
- 唯一交互数:40,774,024,903
- 交互密度:0.0208%
- 总观看时长:858,160,100,084 秒
- 点赞数:1,171,423,458
- 点踩数:11,860,138
- 分享数:262,734,328
- 收藏数:40,124,463
- 点击作者主页次数:84,632,666
- 打开评论区次数:481,251,593
---
## 数据说明
### 隐私保护分类体系
所有分类元数据(user_id、geo、item_id、author_id、place、platform、agent)均已匿名化为稳定整数ID(拆分集内保持一致,不提供反向映射)。
#### 交互数据
交互数据链接:[interactions](https://huggingface.co/datasets/deepvk/VK-LSVD/tree/main/interactions)
每一行代表一次观测(即向用户展示的一条短视频),包含反馈与上下文信息。同一用户-物品对不会出现重复曝光。
**全局时间拆分(Global Temporal Split,GTS)**:训练集/验证集/测试集严格遵循时间顺序——使用历史数据训练,在未来数据上进行验证与测试。
**时间顺序**:文件按周组织(例如week_XX.parquet);每个文件内的行按时间戳递增顺序排列。
| 字段名 | 数据类型 | 描述 |
|-----|----|-----------|
|`user_id`|uint32|用户标识符|
|`item_id`|uint32|视频标识符|
|`place`|uint8|展示场景:信息流/搜索/群组/……(共24种ID)|
|`platform`|uint8|终端平台:Android/Web/TV/……(共11种ID)|
|`agent`|uint8|客户端类型:浏览器/应用(共29种ID)|
|`timespent`|uint8|观看时长(0–255秒)|
|`like`|boolean|用户点赞该视频|
|`dislike`|boolean|用户点踩该视频|
|`share`|boolean|用户分享该视频|
|`bookmark`|boolean|用户收藏该视频|
|`click_on_author`|boolean|用户打开作者主页|
|`open_comments`|boolean|用户打开评论区|
#### 用户元数据
文件路径:[users_metadata.parquet](metadata/users_metadata.parquet)
| 字段名 | 数据类型 | 描述 |
|-----|----|-----------|
|`user_id`|uint32|用户标识符|
|`age`|uint8|用户年龄(18–70岁)|
|`gender`|uint8|用户性别|
|`geo`|uint8|用户最常使用的地理位置(共80种ID)|
|`train_interactions_rank`|uint32|用于采样的流行度排名(数值越低,交互量越高)|
#### 物品元数据
文件路径:[items_metadata.parquet](metadata/items_metadata.parquet)
| 字段名 | 数据类型 | 描述 |
|-----|----|-----------|
|`item_id`|uint32|视频标识符|
|`author_id`|uint32|作者标识符|
|`duration`|uint8|视频时长(秒)|
|`train_interactions_rank`|uint32|用于采样的流行度排名(数值越低,交互量越高)|
#### 嵌入向量:可变维度
嵌入向量仅基于内容(视频/描述/音频等)训练,未混入协同过滤信号。
**组件顺序**:前n个分量的**点积(dot product)**可近似原始生产级嵌入向量的**余弦相似度(cosine similarity)**。
这允许研究人员选择任意维度(1…64)以在质量与计算速度、内存占用之间进行权衡。
文件路径:[item_embeddings.npz](metadata/item_embeddings.npz)
| 字段名 | 数据类型 | 描述 |
|-----|----|-----------|
|`item_id`|uint32|视频标识符|
|`embedding`|float16[64]|带有序列分量的物品内容嵌入向量|
---
## 快速上手
### 加载小子集
python
from huggingface_hub import hf_hub_download
import polars as pl
import numpy as np
subsample_name = 'up0.001_ip0.001'
content_embedding_size = 32
train_interactions_files = [f'subsamples/{subsample_name}/train/week_{i:02}.parquet'
for i in range(25)]
val_interactions_file = [f'subsamples/{subsample_name}/validation/week_25.parquet']
metadata_files = ['metadata/users_metadata.parquet',
'metadata/items_metadata.parquet',
'metadata/item_embeddings.npz']
for file in (train_interactions_files +
val_interactions_file +
metadata_files):
hf_hub_download(
repo_id='deepvk/VK-LSVD', repo_type='dataset',
filename=file, local_dir='VK-LSVD'
)
train_interactions = pl.concat([pl.scan_parquet(f'VK-LSVD/{file}')
for file in train_interactions_files])
train_interactions = train_interactions.collect(engine='streaming')
val_interactions = pl.read_parquet(f'VK-LSVD/{val_interactions_file[0]}')
train_users = train_interactions.select('user_id').unique()
train_items = train_interactions.select('item_id').unique()
item_ids = np.load('VK-LSVD/metadata/item_embeddings.npz')['item_id']
item_embeddings = np.load('VK-LSVD/metadata/item_embeddings.npz')['embedding']
mask = np.isin(item_ids, train_items.to_numpy())
item_ids = item_ids[mask]
item_embeddings = item_embeddings[mask]
item_embeddings = item_embeddings[:, :content_embedding_size]
users_metadata = pl.read_parquet('VK-LSVD/metadata/users_metadata.parquet')
items_metadata = pl.read_parquet('VK-LSVD/metadata/items_metadata.parquet')
users_metadata = users_metadata.join(train_users, on='user_id')
items_metadata = items_metadata.join(train_items, on='item_id')
items_metadata = items_metadata.join(pl.DataFrame({'item_id': item_ids,
'embedding': item_embeddings}),
on='item_id')
---
## 可配置子集
我们提供多种预定义采样子集与简易工具集,用于构建符合您任务需求、数据预算与硬件条件的自定义子集。您可以通过流行度分位数(`train_interactions_rank`)控制交互密度,随机抽取用户,或选择特定时间窗口——同时保留全局时间拆分规则。
为快速实验提供了代表性子集:
| 子集名称 | 用户数 | 物品数 | 交互数 | 交互密度 |
|-----|----:|-----------:|-----------:|-----------:|
|`whole`|10,000,000|19,627,601|40,774,024,903|0.0208%|
|`ur0.1`|1,000,000|18,701,510|4,066,457,259|0.0217%|
|`ur0.01`|100,000|12,467,302|407,854,360|0.0327%|
|`ur0.01_ir0.01`|90,178|125,018|4,044,900|0.0359%|
|`up0.01_ir0.01`|100000|171106|38,404,921|0.2245%|
|`ur0.01_ip0.01`|99,893|196,277|191,625,941|0.9774%|
|`up0.01_ip0.01`|100,000|196,277|1,417,906,344|7.2240%|
|`up0.001_ip0.001`|10,000|19,628|47,976,280|24.4428%|
|`up-0.9_ip-0.9`|8,939,432|17,654,817|2,861,937,212|0.0018%|
符号说明:
- `urX`:抽取X比例的**随机用户**(random users,例如`ur0.01`代表抽取1%的用户)。
- `ipX`:抽取X比例的**热门物品**(按`train_interactions_rank`排序,popular items)。
- 负X代表抽取最不热门的比例(例如`−0.9`代表抽取后90%的低流行度内容)。
例如,若要获取[ur0.01_ip0.01](https://huggingface.co/datasets/deepvk/VK-LSVD/tree/main/subsamples/ur0.01_ip0.01)(1%的随机用户 + 1%的热门物品),可使用以下代码片段:
python
import polars as pl
def get_sample(entries: pl.DataFrame, split_column: str, fraction: float) -> pl.DataFrame:
if fraction >= 0:
entries = entries.filter(pl.col(split_column) <=
pl.col(split_column).quantile(fraction,
interpolation='midpoint'))
else:
entries = entries.filter(pl.col(split_column) >=
pl.col(split_column).quantile(1 + fraction,
interpolation='midpoint'))
return entries
users = pl.scan_parquet('VK-LSVD/metadata/users_metadata.parquet')
users_sample = get_sample(users, 'user_id', 0.01).select(['user_id'])
items = pl.scan_parquet('VK-LSVD/metadata/items_metadata.parquet')
items_sample = get_sample(items, 'train_interactions_rank', 0.01).select(['item_id'])
interactions = pl.scan_parquet('VK-LSVD/interactions/validation/week_25.parquet')
interactions = interactions.join(users_sample, on='user_id', maintain_order='left')
interactions = interactions.join(items_sample, on='item_id', maintain_order='left')
interactions_sample = interactions.collect(engine='streaming')
若要获取[up-0.9_ip-0.9](https://huggingface.co/datasets/deepvk/VK-LSVD/tree/main/subsamples/up-0.9_ip-0.9)(90%的低流行度用户 + 90%的低流行度物品),只需将用户与物品采样代码替换为:
python
users_sample = get_sample(users, 'train_interactions_rank', -0.9).select(['user_id'])
items_sample = get_sample(items, 'train_interactions_rank', -0.9).select(['item_id'])
提供机构:
maas
创建时间:
2025-08-28



