five

yambda

收藏
魔搭社区2026-05-02 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/yambda
下载链接
链接失效反馈
官方服务:
资源简介:
# Yambda-5B — A Large-Scale Multi-modal Dataset for Ranking And Retrieval **Industrial-scale music recommendation dataset with organic/recommendation interactions and audio embeddings** [📌 Overview](#overview) • [🔑 Key Features](#key-features) • [📊 Statistics](#statistics) • [📝 Format](#data-format) • [🏆 Benchmark](#benchmark) • [⬇️ Download](#download) • [❓ FAQ](#faq) ## Overview The Yambda-5B dataset is a large-scale open database comprising **4.79 billion user-item interactions** collected from **1 million users** and spanning **9.39 million tracks**. The dataset includes both implicit feedback, such as listening events, and explicit feedback, in the form of likes and dislikes. Additionally, it provides distinctive markers for organic versus recommendation-driven interactions, along with precomputed audio embeddings to facilitate content-aware recommendation systems. Preprint: https://arxiv.org/abs/2505.22238 ## Key Features - 🎵 4.79B user-music interactions (listens, likes, dislikes, unlikes, undislikes) - 🕒 Timestamps with global temporal ordering - 🔊 Audio embeddings for 7.72M tracks - 💡 Organic and recommendation-driven interactions - 📈 Multiple dataset scales (50M, 500M, 5B interactions) - 🧪 Standardized evaluation protocol with baseline benchmarks ## About Dataset ### Statistics | Dataset | Users | Items | Listens | Likes | Dislikes | |-------------|----------:|----------:|--------------:|-----------:|-----------:| | Yambda-50M | 10,000 | 934,057 | 46,467,212 | 881,456 | 107,776 | | Yambda-500M | 100,000 | 3,004,578 | 466,512,103 | 9,033,960 | 1,128,113 | | Yambda-5B | 1,000,000 | 9,390,623 | 4,649,567,411 | 89,334,605 | 11,579,143 | ### User History Length Distribution ![user history length](assets/img/user_history_len.png "User History Length") ![user history length log-scale](assets/img/user_history_log_len.png "User History Length Log-scale") ### Item Interaction Count ![item interaction count log-scale](assets/img/item_interactions.png "Item Interaction Count Log-scale") ## Data Format ### File Descriptions | File | Description | Schema | |----------------------------|---------------------------------------------|-----------------------------------------------------------------------------------------| | `listens.parquet` | User listening events with playback details | `uid`, `item_id`, `timestamp`, `is_organic`, `played_ratio_pct`, `track_length_seconds` | | `likes.parquet` | User like actions | `uid`, `item_id`, `timestamp`, `is_organic` | | `dislikes.parquet` | User dislike actions | `uid`, `item_id`, `timestamp`, `is_organic` | | `undislikes.parquet` | User undislike actions (reverting dislikes) | `uid`, `item_id`, `timestamp`, `is_organic` | | `unlikes.parquet` | User unlike actions (reverting likes) | `uid`, `item_id`, `timestamp`, `is_organic` | | `multi_event.parquet` | Unified events | `uid`, `item_id`, `timestamp`, `is_organic`, `event_type`, `played_ratio_pct`, `track_length_seconds` | | `embeddings.parquet` | Track audio-embeddings | `item_id`, `embed`, `normalized_embed` | ### Common Event Structure (Homogeneous) Most event files (`listens`, `likes`, `dislikes`, `undislikes`, `unlikes`) share this base structure: | Field | Type | Description | |--------------|--------|-------------------------------------------------------------------------------------| | `uid` | uint32 | Unique user identifier | | `item_id` | uint32 | Unique track identifier | | `timestamp` | uint32 | Delta times, binned into 5s units. | | `is_organic` | uint8 | Boolean flag (0/1) indicating if the interaction was algorithmic (0) or organic (1) | **Sorting**: All files are sorted by (`uid`, `timestamp`) in ascending order. ### Unified Event Structure (Heterogeneous) For applications needing all event types in a unified format: | Field | Type | Description | |------------------------|-------------------|---------------------------------------------------------------| | `uid` | uint32 | Unique user identifier | | `item_id` | uint32 | Unique track identifier | | `timestamp` | uint32 | Timestamp binned into 5s units.granularity | | `is_organic` | uint8 | Boolean flag for organic interactions | | `event_type` | enum | One of: `listen`, `like`, `dislike`, `unlike`, `undislike` | | `played_ratio_pct` | Optional[uint16] | Percentage of track played (1-100), null for non-listen events | | `track_length_seconds` | Optional[uint32] | Total track duration in seconds, null for non-listen events | **Notes**: - `played_ratio_pct` and `track_length_seconds` are non-null **only** when `event_type = "listen"` - All fields except the two above are guaranteed non-null ### Sequential (Aggregated) Format Each dataset is also available in a user-aggregated sequential format with the following structure: | Field | Type | Description | |--------------|--------------|--------------------------------------------------| | `uid` | uint32 | Unique user identifier | | `item_ids` | List[uint32] | Chronological list of interacted track IDs | | `timestamps` | List[uint32] | Corresponding interaction timestamps | | `is_organic` | List[uint8] | Corresponding organic flags for each interaction | | `played_ratio_pct` | List[Optional[uint16]] | (Only in `listens` and `multi_event`) Play percentages | | `track_length_seconds` | List[Optional[uint32]] | (Only in `listens` and `multi_event`) Track durations | **Notes**: - All lists maintain chronological order - For each user, `len(item_ids) == len(timestamps) == len(is_organic)` - In multi-event format, null values are preserved in respective lists ## Benchmark Code for the baseline models can be found in `benchmarks/` directory, see [Reproducibility Guide](benchmarks/README.md) ### Download Simplest way: ```python from datasets import load_dataset ds = load_dataset("yandex/yambda", data_dir="flat/50m", data_files="likes.parquet") ``` Also, we provide simple wrapper for convenient usage: ```python from typing import Literal from datasets import Dataset, DatasetDict, load_dataset class YambdaDataset: INTERACTIONS = frozenset([ "likes", "listens", "multi_event", "dislikes", "unlikes", "undislikes" ]) def __init__( self, dataset_type: Literal["flat", "sequential"] = "flat", dataset_size: Literal["50m", "500m", "5b"] = "50m" ): assert dataset_type in {"flat", "sequential"} assert dataset_size in {"50m", "500m", "5b"} self.dataset_type = dataset_type self.dataset_size = dataset_size def interaction(self, event_type: Literal[ "likes", "listens", "multi_event", "dislikes", "unlikes", "undislikes" ]) -> Dataset: assert event_type in YambdaDataset.INTERACTIONS return self._download(f"{self.dataset_type}/{self.dataset_size}", event_type) def audio_embeddings(self) -> Dataset: return self._download("", "embeddings") def album_item_mapping(self) -> Dataset: return self._download("", "album_item_mapping") def artist_item_mapping(self) -> Dataset: return self._download("", "artist_item_mapping") @staticmethod def _download(data_dir: str, file: str) -> Dataset: data = load_dataset("yandex/yambda", data_dir=data_dir, data_files=f"{file}.parquet") # Returns DatasetDict; extracting the only split assert isinstance(data, DatasetDict) return data["train"] dataset = YambdaDataset("flat", "50m") likes = dataset.interaction("likes") # returns a HuggingFace Dataset ``` ## FAQ ### Are test items presented in training data? Not all, some test items do appear in the training set, others do not. ### Are test users presented in training data? Yes, there are no cold users in the test set. ### How are audio embeddings generated? Using a convolutional neural network inspired by Contrastive Learning of Musical Representations (J. Spijkervet et al., 2021). ### What's the `is_organic` flag? Indicates whether interactions occurred through organic discovery (True) or recommendation-driven pathways (False) ### Which events are considered recommendation-driven? Recommendation events include actions from: - Personalized music feed - Personalized playlists ### What counts as a "listened" track or \\(Listen_+\\)? A track is considered "listened" if over 50% of its duration is played. ### What does it mean when played_ratio_pct is greater than 100? A played_ratio_pct greater than 100% indicates that the user rewound and replayed sections of the track—so the total time listened exceeds the original track length. These values are expected behavior and not log noise. See [discussion](https://huggingface.co/datasets/yandex/yambda/discussions/10)

# Yambda-5B — 用于排序与检索(ranking and retrieval)的大规模多模态(multi-modal)音乐数据集 **工业级规模音乐推荐数据集,包含自然交互(organic interactions)/推荐交互(recommendation-driven interactions)与音频嵌入(audio embeddings)** [📌 数据集概览](#overview) • [🔑 核心特性](#key-features) • [📊 统计信息](#statistics) • [📝 数据格式](#data-format) • [🏆 基准测试](#benchmark) • [⬇️ 下载](#download) • [❓ 常见问题](#faq) ## 数据集概览 Yambda-5B是一个大规模开源数据库,包含**47.9亿次用户-音乐交互记录**,采集自**100万用户**,覆盖**939万首曲目**。该数据集既包含隐式反馈(implicit feedback,如播放行为),也包含显式反馈(explicit feedback,如点赞、点踩)。此外,它还为自然交互与推荐驱动交互提供了区分标记,并附带预计算的音频嵌入(audio embeddings),以支持内容感知型推荐系统(content-aware recommendation systems)。 预印本链接:https://arxiv.org/abs/2505.22238 ## 核心特性 - 🎵 47.9亿次用户-音乐交互(播放、点赞、点踩、取消点赞、取消点踩) - 🕒 带全局时间序的时间戳信息 - 🔊 772万首曲目的音频嵌入(audio embeddings) - 💡 区分自然交互(organic interactions)与推荐驱动交互(recommendation-driven interactions) - 📈 多尺度数据集(5000万、5亿、50亿次交互) - 🧪 带有基准测试的标准化评估协议 ## 数据集详情 ### 统计信息 | 数据集名称 | 用户数 | 曲目数 | 播放次数 | 点赞数 | 点踩数 | |-------------|----------:|----------:|--------------:|-----------:|-----------:| | Yambda-50M | 10,000 | 934,057 | 46,467,212 | 881,456 | 107,776 | | Yambda-500M | 100,000 | 3,004,578 | 466,512,103 | 9,033,960 | 1,128,113 | | Yambda-5B | 1,000,000 | 9,390,623 | 4,649,567,411 | 89,334,605 | 11,579,143 | ### 用户历史交互长度分布 ![用户历史交互长度](assets/img/user_history_len.png "用户历史交互长度分布") ![用户历史交互长度(对数尺度)](assets/img/user_history_log_len.png "用户历史交互长度对数分布") ### 曲目交互次数分布 ![曲目交互次数(对数尺度)](assets/img/item_interactions.png "曲目交互次数对数分布") ## 数据格式 ### 文件说明 | 文件名称 | 描述 | 数据结构 | |----------------------------|---------------------------------------------|-----------------------------------------------------------------------------------------| | `listens.parquet` | 包含播放详情的用户播放行为记录 | `uid`、`item_id`、`timestamp`、`is_organic`、`played_ratio_pct`、`track_length_seconds` | | `likes.parquet` | 用户点赞行为记录 | `uid`、`item_id`、`timestamp`、`is_organic` | | `dislikes.parquet` | 用户点踩行为记录 | `uid`、`item_id`、`timestamp`、`is_organic` | | `undislikes.parquet` | 用户取消点踩行为记录(撤销点踩操作) | `uid`、`item_id`、`timestamp`、`is_organic` | | `unlikes.parquet` | 用户取消点赞行为记录(撤销点赞操作) | `uid`、`item_id`、`timestamp`、`is_organic` | | `multi_event.parquet` | 统一格式交互事件集 | `uid`、`item_id`、`timestamp`、`is_organic`、`event_type`、`played_ratio_pct`、`track_length_seconds` | | `embeddings.parquet` | 曲目音频嵌入数据集 | `item_id`、`embed`、`normalized_embed` | ### 通用事件结构(同构格式) 多数事件文件(`listens`、`likes`、`dislikes`、`undislikes`、`unlikes`)采用如下基础结构: | 字段名 | 类型 | 描述 | |--------------|--------|-------------------------------------------------------------------------------------| | `uid` | uint32 | 唯一用户标识符 | | `item_id` | uint32 | 唯一曲目标识符 | | `timestamp` | uint32 | 以5秒为单位的时间间隔偏移量。 | | `is_organic` | uint8 | 布尔标记(0/1),用于标识该交互是算法推荐产生(0)还是自然发现产生(1) | **排序规则**:所有文件均按(`uid`,`timestamp`)升序排序。 ### 统一事件结构(异构格式) 对于需要统一格式存储所有事件类型的应用场景,可采用如下结构: | 字段名 | 类型 | 描述 | |------------------------|-------------------|---------------------------------------------------------------| | `uid` | uint32 | 唯一用户标识符 | | `item_id` | uint32 | 唯一曲目标识符 | | `timestamp` | uint32 | 以5秒为单位的时间戳粒度。 | | `is_organic` | uint8 | 自然交互布尔标记 | | `event_type` | 枚举类型 | 可选值为:`listen`(播放)、`like`(点赞)、`dislike`(点踩)、`unlike`(取消点赞)、`undislike`(取消点踩) | | `played_ratio_pct` | 可选[uint16] | 曲目播放占比(1-100),非播放事件该字段为空 | | `track_length_seconds` | 可选[uint32] | 曲目总时长(秒),非播放事件该字段为空 | **说明**: - 仅当`event_type = "listen"`时,`played_ratio_pct`与`track_length_seconds`字段非空 - 除上述两个字段外,其余字段均保证非空 ### 序列(聚合)格式 各数据集还提供了用户聚合的序列格式,结构如下: | 字段名 | 类型 | 描述 | |--------------|--------------|--------------------------------------------------| | `uid` | uint32 | 唯一用户标识符 | | `item_ids` | List[uint32] | 按时间序排列的交互曲目ID列表 | | `timestamps` | List[uint32] | 对应交互的时间戳列表 | | `is_organic` | List[uint8] | 各交互对应的自然交互标记列表 | | `played_ratio_pct` | List[Optional[uint16]] | (仅在`listens`与`multi_event`格式中出现)播放占比列表 | | `track_length_seconds` | List[Optional[uint32]] | (仅在`listens`与`multi_event`格式中出现)曲目时长列表 | **说明**: - 所有列表均保持时间序排列 - 对每个用户而言,`len(item_ids) == len(timestamps) == len(is_organic)` - 在多事件格式中,对应列表将保留空值 ## 基准测试 基准模型代码可在`benchmarks/`目录中找到,详见[复现指南](benchmarks/README.md) ### 下载 最简单的使用方式: python from datasets import load_dataset ds = load_dataset("yandex/yambda", data_dir="flat/50m", data_files="likes.parquet") 同时我们提供了便捷的封装类以简化使用: python from typing import Literal from datasets import Dataset, DatasetDict, load_dataset class YambdaDataset: INTERACTIONS = frozenset([ "likes", "listens", "multi_event", "dislikes", "unlikes", "undislikes" ]) def __init__( self, dataset_type: Literal["flat", "sequential"] = "flat", dataset_size: Literal["50m", "500m", "5b"] = "50m" ): assert dataset_type in {"flat", "sequential"} assert dataset_size in {"50m", "500m", "5b"} self.dataset_type = dataset_type self.dataset_size = dataset_size def interaction(self, event_type: Literal[ "likes", "listens", "multi_event", "dislikes", "unlikes", "undislikes" ]) -> Dataset: assert event_type in YambdaDataset.INTERACTIONS return self._download(f"{self.dataset_type}/{self.dataset_size}", event_type) def audio_embeddings(self) -> Dataset: return self._download("", "embeddings") def album_item_mapping(self) -> Dataset: return self._download("", "album_item_mapping") def artist_item_mapping(self) -> Dataset: return self._download("", "artist_item_mapping") @staticmethod def _download(data_dir: str, file: str) -> Dataset: data = load_dataset("yandex/yambda", data_dir=data_dir, data_files=f"{file}.parquet") # 返回DatasetDict对象;提取其中唯一的数据集拆分 assert isinstance(data, DatasetDict) return data["train"] dataset = YambdaDataset("flat", "50m") likes = dataset.interaction("likes") # 返回Hugging Face Dataset数据集对象 ## 常见问题 ### 测试集曲目是否会出现在训练集中? 并非全部,部分测试曲目会出现在训练集中,其余则不会。 ### 测试集用户是否会出现在训练集中? 是的,测试集中不存在冷启动用户(cold users)。 ### 音频嵌入(audio embeddings)是如何生成的? 采用借鉴了音乐表征对比学习(Contrastive Learning of Musical Representations,J. Spijkervet等人,2021)的卷积神经网络(convolutional neural network)生成。 ### `is_organic`标记代表什么? 用于标识该交互是通过自然发现产生(True)还是通过推荐系统推送产生(False) ### 哪些交互属于推荐驱动的交互? 推荐交互包含来自以下场景的行为: - 个性化音乐推荐流 - 个性化歌单 ### 何为“已播放曲目”或$Listen_+$? 当用户播放了超过曲目总时长50%的内容时,即视为该曲目被“播放”。 ### 当`played_ratio_pct`大于100时代表什么? `played_ratio_pct`超过100%表示用户对曲目进行了快退重播操作,即总播放时长超过了曲目原始时长。此类情况属于正常行为,并非日志异常。详见[讨论帖](https://huggingface.co/datasets/yandex/yambda/discussions/10)
提供机构:
maas
创建时间:
2025-05-29
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
Yambda-5B是一个大规模的音乐推荐数据集,包含47.9亿用户-音乐交互记录,涵盖100万用户和939万首曲目。数据集提供隐式和显式反馈,区分有机和推荐驱动的交互,并包含预计算的音频嵌入,适用于内容感知的推荐系统研究。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作