SOSD (Searching on Sorted Data)

Name: SOSD (Searching on Sorted Data)
Creator: OpenDataLab
Published: 2026-05-24 07:30:03
License: 暂无描述

OpenDataLab2026-05-24 更新2024-05-09 收录

下载链接：

https://opendatalab.org.cn/OpenDataLab/SOSD

下载链接

链接失效反馈

官方服务：

资源简介：

"SOSD 是一个数据集集合，用于对学习索引的查找性能进行基准测试。SOSD 目前包括八个不同的数据集。每个数据集由 2 亿个 64 位无符号整数（键）组成，重复很少（如果有的话）： amzn 表示图书销售人气数据。face 是 Facebook 用户 ID 数据集的上采样版本。logn 和 norm 分别是对数正态 (0, 2) 和正态分布。osmc 是统一采样的 OpenStreetMap 位置，表示为 Google S2 CellIds。uden 是密集的整数。uspr 是均匀分布的稀疏整数。wiki 是维基百科文章编辑时间戳。此外，所有数据集（osmc 和 wiki 除外）都有 32 位版本，具有相似的 CDF。我们使用不同的参数，(0, 1)，用于以 32 位的情况登录以减少重复的数量。”

SOSD is a collection of datasets for benchmarking the lookup performance of learned indexes. SOSD currently includes eight distinct datasets. Each dataset consists of 200 million 64-bit unsigned integers (keys) with very few, if any, duplicates: - amzn: Represents book sales popularity data. - face: An upsampled version of the Facebook user ID dataset. - logn and norm: Follow the log-normal (0, 2) and normal distributions respectively. - osmc: Consists of uniformly sampled OpenStreetMap locations represented as Google S2 CellIds. - uden: Comprises dense integers. - uspr: Consists of uniformly distributed sparse integers. - wiki: Represents edit timestamps of Wikipedia articles. Additionally, all datasets except osmc and wiki have 32-bit variants with similar cumulative distribution functions (CDFs). We use different parameters, (0, 1), for the 32-bit variant of logn to reduce the number of duplicates.

提供机构：

OpenDataLab

创建时间：

2022-05-05

搜集汇总

数据集介绍