Transparency in Keyword Faceted Search: a dataset of Google Shopping html pages
收藏NIAID Data Ecosystem2026-03-11 收录
下载链接:
https://zenodo.org/record/1491556
下载链接
链接失效反馈官方服务:
资源简介:
This dataset contains a collection of around 2,000 HTML pages: these web pages contain the search results obtained in return to queries for different products, searched by a set of synthetic users surfing Google Shopping (US version) from different locations, in July, 2016.
Each file in the collection has a name where there is indicated the location from where the search has been done, the userID, and the searched product: no_email_LOCATION_USERID.PRODUCT.shopping_testing.#.html
The locations are Philippines (PHI), United States (US), India (IN). The userIDs: 26 to 30 for users searching from Philippines, 1 to 5 from US, 11 to 15 from India.
Products have been choice following 130 keywords (e.g., MP3 player, MP4 Watch, Personal organizer, Television, etc.).
In the following, we describe how the search results have been collected.
Each user has a fresh profile. The creation of a new profile corresponds to launch a new, isolated, web browser client instance and open the Google Shopping US web page.
To mimic real users, the synthetic users can browse, scroll pages, stay on a page, and click on links.
A fully-fledged web browser is used to get the correct desktop version of the website under investigation. This is because websites could be designed to behave according to user agents, as witnessed by the differences between the mobile and desktop versions of the same website.
The prices are the retail ones displayed by Google Shopping in US dollars (thus, excluding shipping fees).
Several frameworks have been proposed for interacting with web browsers and analysing results from search engines. This research adopts OpenWPM. OpenWPM is automatised with Selenium to efficiently create and manage different users with isolated Firefox and Chrome client instances, each of them with their own associated cookies.
The experiments run, on average, 24 hours. In each of them, the software runs on our local server, but the browser's traffic is redirected to the designated remote servers (i.e., to India), via tunneling in SOCKS proxies. This way, all commands are simultaneously distributed over all proxies. The experiments adopt the Mozilla Firefox browser (version 45.0) for the web browsing tasks and run under Ubuntu 14.04. Also, for each query, we consider the first page of results, counting 40 products. Among them, the focus of the experiments is mostly on the top 10 and top 3 results.
Due to connection errors, one of the Philippine profiles have no associated results. Also, for Philippines, a few keywords did not lead to any results: videocassette recorders, totes, umbrellas. Similarly, for US, no results were for totes and umbrellas.
The search results have been analyzed in order to check if there were evidence of price steering, based on users' location.
One term of usage applies:
In any research product whose findings are based on this dataset, please cite
@inproceedings{DBLP:conf/ircdl/CozzaHPN19,
author = {Vittoria Cozza and
Van Tien Hoang and
Marinella Petrocchi and
Rocco {De Nicola}},
title = {Transparency in Keyword Faceted Search: An Investigation on Google
Shopping},
booktitle = {Digital Libraries: Supporting Open Science - 15th Italian Research
Conference on Digital Libraries, {IRCDL} 2019, Pisa, Italy, January
31 - February 1, 2019, Proceedings},
pages = {29--43},
year = {2019},
crossref = {DBLP:conf/ircdl/2019},
url = {https://doi.org/10.1007/978-3-030-11226-4\_3},
doi = {10.1007/978-3-030-11226-4\_3},
timestamp = {Fri, 18 Jan 2019 23:22:50 +0100},
biburl = {https://dblp.org/rec/bib/conf/ircdl/CozzaHPN19},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
创建时间:
2020-01-24



