five

skadio/optimized_item_selection

收藏
Hugging Face2024-01-05 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/skadio/optimized_item_selection
下载链接
链接失效反馈
官方服务:
资源简介:
# Optimized Item Selection Datasets We provide the datasets that are used to test the multi-level optimization framework ([CPAIOR'21](https://link.springer.com/chapter/10.1007/978-3-030-78230-6_27), [DSO@IJCAI'22](https://arxiv.org/abs/2112.03105)), for solving Item Selection Problem (ISP) to boost exploration in Recommender Systems. The the multi-objective optimization framework is implemented in [Selective](https://github.com/fidelity/selective) as part of `TextBased Selection`. By solving the ISP with Text-based Selection in Selective, we select a smaller subset of items with maximum diversity in the latent embedding space of items and maximum coverage of labels. The datasets are extracted and processed from their original public sources for research purposes as detailed below. ## Overview of Datasets The datasets include: * [**GoodReads datasets**](book_recommenders_data/) for book recommenders. Two datasets are randomly selected from the source data [GoodReads Book Reviews](https://dl.acm.org/doi/10.1145/3240323.3240369), a small version with 1000 items and a large version with 10,000 items. For book recommendations, there are 11 different genres (e.g., fiction, non-fiction, children), 231 different publishers (e.g. Vintage, Penguin Books, Mariner Books), and genre-publisher pairs. This leads to 574 and 1,322 unique book labels for the small and large datasets, respectively. * [**MovieLens datasets**](movie_recommenders_data/) for movie recommenders. Two datasets are randomly selected from the source data [MovieLens Movie Ratings](https://dl.acm.org/doi/10.1145/2827872), a small version with 1000 items and a large version with 10,000 items. For movie recommendations, there are 19 different genres (e.g. action, comedy, drama, romance), 587 different producers, 34 different languages (e.g. English, French, Mandarin), and genre-language pairs. This leads to 473 and 1,011 unique movie labels for the small and large datasets, respectively. Each dataset in GoodReads and MovieLens contains: * `*_data.csv` that contains the text content (i.e., title + description) of the items, and * `*_label.csv` that contains the labels (e.g., genre or language) and a binary 0/1 value denoting whether an item exbihits a label. Each column in the csv file is for an item, indexed by book/movie ID. The order of columns in data and label files are the same. ## Quick Start To run the example, install required packages by `pip install selective datasets` ```python # Import Selective (for text-based selection) and TextWiser (for embedding space) import pandas as pd from datasets import load_dataset from textwiser import TextWiser, Embedding, Transformation from feature.selector import Selective, SelectionMethod # Load Text Contents data = load_dataset('skadio/optimized_item_selection', data_files='book_recommenders_data/goodreads_1k_data.csv', split='train') data = data.to_pandas() # Load Labels labels = load_dataset('skadio/optimized_item_selection', data_files='book_recommenders_data/goodreads_1k_label.csv', split='train') labels = labels.to_pandas() labels.set_index('label', inplace=True) # TextWiser featurization method to create text embeddings textwiser = TextWiser(Embedding.TfIdf(), Transformation.NMF(n_components=20, random_state=1234)) # Text-based selection with the default configuration # The default configuration is optimization_method="exact" and cost_metric ="diverse" # By default, multi-level optimization maximizes coverage and diversity as described in (CPAIOR'21, DSO@IJCAI'22) # within an upper bound on subset size given as num_features selector = Selective(SelectionMethod.TextBased(num_features=30, featurization_method=textwiser)) # Result subset = selector.fit_transform(data, labels) print("Reduction:", list(subset.columns)) ``` ## Advanced Usages Text-based Selection provides access to multiple selection methods. At a high-level, the configurations can be divided into exact, randomized, greedy or cluster-based optimization. ### Exact - (Default) Solve for Problem *P_max_cover@t* in **CPAIOR'21** - Selecting a subset of items that maximizes coverage of labels and maximizes the diversity in latent embedding space within an upper bound on subset size. ```python selector = Selective(SelectionMethod.TextBased(num_features=30, featurization_method=textwiser, optimization_method='exact', cost_metric='diverse')) ``` - Solve for Problem *P_unicost* in **CPAIOR'21** - Selecting a subset of items that covers all labels. ```python selector = Selective(SelectionMethod.TextBased(num_features=None, optimization_method='exact', cost_metric='unicost')) ``` - Solve for Problem *P_diverse* in **CPAIOR'21** - Selecting a subset of items with maximized diversity in the latent embedding space while still maintaining the coverage over all labels. ```python selector = Selective(SelectionMethod.TextBased(num_features=None, featurization_method=textwiser, optimization_method='exact', cost_metric='diverse')) ``` - Selecting a subset of items that only maximizes coverage within an upper bound on subset size. ```python selector = Selective(SelectionMethod.TextBased(num_features=30, optimization_method='exact', cost_metric='unicost')) ``` ### Randomized - Selecting a subset by performing random selection. If num_features is not set, subset size is defined by solving *P_unicost*. ```python selector = Selective(SelectionMethod.TextBased(num_features=None, optimization_method='random')) ``` - Selecting a subset by performing random selection. Subset size is defined by num_features. ```python selector = Selective(SelectionMethod.TextBased(num_features=30, optimization_method='random')) ``` ### Greedy - Selecting a subset by adding an item each time using a greedy heuristic in selection with a given cost_metric, i.e. `diverse` by default or `unicost`. If num_features is not set, subset size is defined by solving *P_unicost*. ```python selector = Selective(SelectionMethod.TextBased(num_features=None, optimization_method='greedy', cost_metric='unicost')) ``` - Selecting a subset by adding an item each time using a greedy heuristic in selection with a given cost_metric, i.e. `diverse` by default or `unicost`. ```python selector = Selective(SelectionMethod.TextBased(num_features=30, optimization_method='greedy', cost_metric='unicost')) ``` ### Clustering - Selecting a subset by clustering items into a number of clusters and the items close to the centroids are selected. If num_features is not set, subset size is defined by solving *P_unicost*. `cost_metric` argument is not used in this method. ```python selector = Selective(SelectionMethod.TextBased(num_features=None, optimization_method='kmeans')) ``` - Selecting a subset by clustering items into a number of clusters and the items close to the centroids are selected. `cost_metric` argument is not used in this method. ```python selector = Selective(SelectionMethod.TextBased(num_features=30, optimization_method='kmeans')) ``` ## Citation If you use ISP in our research/applications, please cite as follows: ```bibtex @inproceedings{cpaior2021, title={Optimized Item Selection to Boost Exploration for Recommender Systems}, author={Serdar Kadıoğlu and Bernard Kleynhans and Xin Wang}, booktitle={Proceedings of Integration of Constraint Programming, Artificial Intelligence, and Operations Research: 18th International Conference, CPAIOR 2021, Vienna, Austria, July 5–8, 2021}, url={https://doi.org/10.1007/978-3-030-78230-6_27}, pages = {427–445}, year={2021} } ``` ```bibtex @inproceedings{ijcai2022, title={Active Learning Meets Optimized Item Selection}, author={Bernard Kleynhans and Xin Wang and Serdar Kadıoğlu}, booktitle={The IJCAI-22 Workshop: Data Science meets Optimisation} publisher={arXiv}, url={https://arxiv.org/abs/2112.03105}, year={2022} } ```
提供机构:
skadio
原始信息汇总

优化项目选择数据集

数据集概述

数据集包括:

  • GoodReads数据集:用于图书推荐。从原始数据GoodReads Book Reviews中随机选择了两个数据集,一个小版本包含1000个项目,一个大版本包含10,000个项目。对于图书推荐,有11个不同的流派(如小说、非小说、儿童),231个不同的出版商(如Vintage、Penguin Books、Mariner Books),以及流派-出版商对。这导致小数据集和大数据集分别有574和1,322个独特的图书标签。

  • MovieLens数据集:用于电影推荐。从原始数据MovieLens Movie Ratings中随机选择了两个数据集,一个小版本包含1000个项目,一个大版本包含10,000个项目。对于电影推荐,有19个不同的流派(如动作、喜剧、戏剧、浪漫),587个不同的制片人,34种不同的语言(如英语、法语、普通话),以及流派-语言对。这导致小数据集和大数据集分别有473和1,011个独特的电影标签。

每个数据集包含:

  • *_data.csv:包含项目的文本内容(即标题+描述),
  • *_label.csv:包含标签(如流派或语言)和一个二进制0/1值,表示项目是否展示该标签。

每个csv文件中的列代表一个项目,按图书/电影ID索引。数据和标签文件中的列顺序相同。

快速开始

要运行示例,请安装所需的包: python

导入Selective(用于基于文本的选择)和TextWiser(用于嵌入空间)

import pandas as pd from datasets import load_dataset from textwiser import TextWiser, Embedding, Transformation from feature.selector import Selective, SelectionMethod

加载文本内容

data = load_dataset(skadio/optimized_item_selection, data_files=book_recommenders_data/goodreads_1k_data.csv, split=train) data = data.to_pandas()

加载标签

labels = load_dataset(skadio/optimized_item_selection, data_files=book_recommenders_data/goodreads_1k_label.csv, split=train) labels = labels.to_pandas() labels.set_index(label, inplace=True)

使用TextWiser特征化方法创建文本嵌入

textwiser = TextWiser(Embedding.TfIdf(), Transformation.NMF(n_components=20, random_state=1234))

使用默认配置进行基于文本的选择

默认配置是optimization_method="exact"和cost_metric="diverse"

默认情况下,多级优化在给定的子集大小上限内最大化覆盖率和多样性,如(CPAIOR21, DSO@IJCAI22)所述

selector = Selective(SelectionMethod.TextBased(num_features=30, featurization_method=textwiser))

结果

subset = selector.fit_transform(data, labels) print("Reduction:", list(subset.columns))

高级用法

基于文本的选择提供了多种选择方法。

在高级配置中,可以分为精确、随机、贪婪或基于聚类的优化。

精确

  • 默认情况下,解决CPAIOR21中的问题P_max_cover@t,选择一个子集,最大化标签覆盖率和嵌入空间中的多样性,并在子集大小上限内。 python selector = Selective(SelectionMethod.TextBased(num_features=30, featurization_method=textwiser, optimization_method=exact, cost_metric=diverse))

  • 解决CPAIOR21中的问题P_unicost,选择一个覆盖所有标签的子集。 python selector = Selective(SelectionMethod.TextBased(num_features=None, optimization_method=exact, cost_metric=unicost))

  • 解决CPAIOR21中的问题P_diverse,选择一个在嵌入空间中最大化多样性并保持所有标签覆盖的子集。 python selector = Selective(SelectionMethod.TextBased(num_features=None, featurization_method=textwiser, optimization_method=exact, cost_metric=diverse))

  • 选择一个仅在子集大小上限内最大化覆盖率的子集。 python selector = Selective(SelectionMethod.TextBased(num_features=30, optimization_method=exact, cost_metric=unicost))

随机

  • 通过随机选择进行子集选择。如果未设置num_features,则子集大小由解决P_unicost定义。 python selector = Selective(SelectionMethod.TextBased(num_features=None, optimization_method=random))

  • 通过随机选择进行子集选择。子集大小由num_features定义。 python selector = Selective(SelectionMethod.TextBased(num_features=30, optimization_method=random))

贪婪

  • 通过每次添加一个项目使用贪婪启发式进行子集选择,给定cost_metric,即默认情况下为diverseunicost。如果未设置num_features,则子集大小由解决P_unicost定义。 python selector = Selective(SelectionMethod.TextBased(num_features=None, optimization_method=greedy, cost_metric=unicost))

  • 通过每次添加一个项目使用贪婪启发式进行子集选择,给定cost_metric,即默认情况下为diverseunicost。 python selector = Selective(SelectionMethod.TextBased(num_features=30, optimization_method=greedy, cost_metric=unicost))

聚类

  • 通过将项目聚类成多个簇并选择靠近质心的项目进行子集选择。如果未设置num_features,则子集大小由解决P_unicost定义。cost_metric参数在此方法中不使用。 python selector = Selective(SelectionMethod.TextBased(num_features=None, optimization_method=kmeans))

  • 通过将项目聚类成多个簇并选择靠近质心的项目进行子集选择。cost_metric参数在此方法中不使用。 python selector = Selective(SelectionMethod.TextBased(num_features=30, optimization_method=kmeans))

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作