five

erfanloghmani/myket-android-application-recommendation-dataset

收藏
Hugging Face2024-06-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/erfanloghmani/myket-android-application-recommendation-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 task_categories: - graph-ml size_categories: - 100K<n<1M configs: - config_name: main_data data_files: "myket.csv" - config_name: package_name_features data_files: "app_info.csv" --- # Myket Android Application Install Dataset This dataset contains information on application install interactions of users in the [Myket](https://myket.ir/) android application market. The dataset was created for the purpose of evaluating interaction prediction models, requiring user and item identifiers along with timestamps of the interactions. ## Data Creation The dataset was initially generated by the Myket data team, and later cleaned and subsampled by Erfan Loghmani a master student at Sharif University of Technology at the time. The data team focused on a two-week period and randomly sampled 1/3 of the users with interactions during that period. They then selected install and update interactions for three months before and after the two-week period, resulting in interactions spanning about 6 months and two weeks. We further subsampled and cleaned the data to focus on application download interactions. We identified the top 8000 most installed applications and selected interactions related to them. We retained users with more than 32 interactions, resulting in 280,391 users. From this group, we randomly selected 10,000 users, and the data was filtered to include only interactions for these users. The detailed procedure can be found in [here](https://github.com/erfanloghmani/myket-android-application-market-dataset/blob/main/create_data.ipynb). ## Data Structure The dataset has two main files. - `myket.csv`: This file contains the interaction information and follows the same format as the datasets used in the "[JODIE: Predicting Dynamic Embedding Trajectory in Temporal Interaction Networks](https://github.com/claws-lab/jodie)" (ACM SIGKDD 2019) project. However, this data does not contain state labels and interaction features, resulting in associated columns being all zero. - `app_info_sample.csv`: This file comprises features associated with applications present in the sample. For each individual application, information such as the approximate number of installs, average rating, count of ratings, and category are included. These features provide insights into the applications present in the dataset. **Note**: The number of installs, average rating, and the count of ratings info are gathered during the period of interaction data collection, so they may cause information leakage if used for interaction prediction tasks. ## Dataset Details - Total Instances: 694,121 install interaction instances - Instances Format: Triplets of user_id, app_name, timestamp - 10,000 users and 7,988 android applications For a detailed summary of the data's statistics, including information on users, applications, and interactions, please refer to the Python notebook available at [summary-stats.ipynb](https://github.com/erfanloghmani/myket-android-application-market-dataset/blob/main/summary-stats.ipynb). The notebook provides an overview of the dataset's characteristics and can be helpful for understanding the data's structure before using it for research or analysis. ### Top 20 Most Installed Applications | Package Name | Count of Interactions | | ---------------------------------- | --------------------- | | com.instagram.android | 15292 | | ir.resaneh1.iptv | 12143 | | com.tencent.ig | 7919 | | com.ForgeGames.SpecialForcesGroup2 | 7797 | | ir.nomogame.ClutchGame | 6193 | | com.dts.freefireth | 6041 | | com.whatsapp | 5876 | | com.supercell.clashofclans | 5817 | | com.mojang.minecraftpe | 5649 | | com.lenovo.anyshare.gps | 5076 | | ir.medu.shad | 4673 | | com.firsttouchgames.dls3 | 4641 | | com.activision.callofduty.shooter | 4357 | | com.tencent.iglite | 4126 | | com.aparat | 3598 | | com.kiloo.subwaysurf | 3135 | | com.supercell.clashroyale | 2793 | | co.palang.QuizOfKings | 2589 | | com.nazdika.app | 2436 | | com.digikala | 2413 | ## Comparison with SNAP Datasets The Myket dataset introduced in this repository exhibits distinct characteristics compared to the real-world datasets used by the project. The table below provides a comparative overview of the key dataset characteristics: | Dataset | #Users | #Items | #Interactions | Average Interactions per User | Average Unique Items per User | | --------- | ----------------- | ----------------- | ----------------- | ----------------------------- | ----------------------------- | | **Myket** | **10,000** | **7,988** | 694,121 | 69.4 | 54.6 | | LastFM | 980 | 1,000 | 1,293,103 | 1,319.5 | 158.2 | | Reddit | **10,000** | 984 | 672,447 | 67.2 | 7.9 | | Wikipedia | 8,227 | 1,000 | 157,474 | 19.1 | 2.2 | | MOOC | 7,047 | 97 | 411,749 | 58.4 | 25.3 | The Myket dataset stands out by having an ample number of both users and items, highlighting its relevance for real-world, large-scale applications. Unlike LastFM, Reddit, and Wikipedia datasets, where users exhibit repetitive item interactions, the Myket dataset contains a comparatively lower amount of repetitive interactions. This unique characteristic reflects the diverse nature of user behaviors in the Android application market environment. ## Citation If you use this dataset in your research, please cite the following [preprint](https://arxiv.org/abs/2308.06862): ``` @misc{loghmani2023effect, title={Effect of Choosing Loss Function when Using T-batching for Representation Learning on Dynamic Networks}, author={Erfan Loghmani and MohammadAmin Fazli}, year={2023}, eprint={2308.06862}, archivePrefix={arXiv}, primaryClass={cs.LG} } ```
提供机构:
erfanloghmani
原始信息汇总

Myket Android Application Install Dataset

数据集概述

该数据集包含来自Myket安卓应用市场用户的应用安装交互信息。数据集旨在评估交互预测模型,需要用户和项目标识符以及交互的时间戳。

数据创建

数据集由Myket数据团队初始生成,后由Erfan Loghmani(当时是Sharif University of Technology的硕士学生)进行清理和子采样。数据团队关注两周时间,随机抽样了该期间有交互的1/3用户,并选择了前后三个月的安装和更新交互,导致交互时间跨度约为六个月零两周。

进一步子采样和清理数据,专注于应用下载交互。识别了安装量最高的8000个应用,并选择了与它们相关的交互。保留了超过32次交互的用户,共280,391名用户。从中随机选择了10,000名用户,数据被过滤为仅包含这些用户的交互。详细过程可在此处找到。

数据结构

数据集包含两个主要文件:

数据集详情

  • 总实例数:694,121个安装交互实例
  • 实例格式:用户ID、应用名称、时间戳的三元组
  • 用户数:10,000,应用数:7,988

详细数据统计信息可在summary-stats.ipynb中找到。

最受欢迎的20个应用

应用包名 交互次数
com.instagram.android 15292
ir.resaneh1.iptv 12143
com.tencent.ig 7919
com.ForgeGames.SpecialForcesGroup2 7797
ir.nomogame.ClutchGame 6193
com.dts.freefireth 6041
com.whatsapp 5876
com.supercell.clashofclans 5817
com.mojang.minecraftpe 5649
com.lenovo.anyshare.gps 5076
ir.medu.shad 4673
com.firsttouchgames.dls3 4641
com.activision.callofduty.shooter 4357
com.tencent.iglite 4126
com.aparat 3598
com.kiloo.subwaysurf 3135
com.supercell.clashroyale 2793
co.palang.QuizOfKings 2589
com.nazdika.app 2436
com.digikala 2413

与其他数据集的比较

数据集 用户数 项目数 交互次数 平均用户交互次数 平均用户独特项目数
Myket 10,000 7,988 694,121 69.4 54.6
LastFM 980 1,000 1,293,103 1,319.5 158.2
Reddit 10,000 984 672,447 67.2 7.9
Wikipedia 8,227 1,000 157,474 19.1 2.2
MOOC 7,047 97 411,749 58.4 25.3

Myket数据集在用户和项目数量上具有优势,适用于大规模实际应用。与LastFM、Reddit和Wikipedia数据集相比,Myket数据集的重复交互较少,反映了安卓应用市场用户行为的多样性。

引用

如在研究中使用此数据集,请引用以下预印本:

@misc{loghmani2023effect, title={Effect of Choosing Loss Function when Using T-batching for Representation Learning on Dynamic Networks}, author={Erfan Loghmani and MohammadAmin Fazli}, year={2023}, eprint={2308.06862}, archivePrefix={arXiv}, primaryClass={cs.LG} }

搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作