five

Transparency in Keyword Faceted Search: a dataset of Google Shopping html pages

收藏
Mendeley Data2024-03-27 更新2024-06-27 收录
下载链接:
https://zenodo.org/record/1491557
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset contains a collection of around 2,000 HTML pages: these web pages contain the search results obtained in return to queries for different products, searched by a set of synthetic users surfing Google Shopping (US version) from different locations, in July, 2016. Each file in the collection has a name where there is indicated the location from where the search has been done, the userID, and the searched product: no_email_LOCATION_USERID.PRODUCT.shopping_testing.#.html The locations are Philippines (PHI), United States (US), India (IN). The userIDs: 26 to 30 for users searching from Philippines, 1 to 5 from US, 11 to 15 from India. Products have been choice following 130 keywords (e.g., MP3 player, MP4 Watch, Personal organizer, Television, etc.). In the following, we describe how the search results have been collected. Each user has a fresh profile. The creation of a new profile corresponds to launch a new, isolated, web browser client instance and open the Google Shopping US web page. To mimic real users, the synthetic users can browse, scroll pages, stay on a page, and click on links. A fully-fledged web browser is used to get the correct desktop version of the website under investigation. This is because websites could be designed to behave according to user agents, as witnessed by the differences between the mobile and desktop versions of the same website. The prices are the retail ones displayed by Google Shopping in US dollars (thus, excluding shipping fees). Several frameworks have been proposed for interacting with web browsers and analysing results from search engines. This research adopts OpenWPM. OpenWPM is automatised with Selenium to efficiently create and manage different users with isolated Firefox and Chrome client instances, each of them with their own associated cookies. The experiments run, on average, 24 hours. In each of them, the software runs on our local server, but the browser's traffic is redirected to the designated remote servers (i.e., to India), via tunneling in SOCKS proxies. This way, all commands are simultaneously distributed over all proxies. The experiments adopt the Mozilla Firefox browser (version 45.0) for the web browsing tasks and run under Ubuntu 14.04. Also, for each query, we consider the first page of results, counting 40 products. Among them, the focus of the experiments is mostly on the top 10 and top 3 results. Due to connection errors, one of the Philippine profiles have no associated results. Also, for Philippines, a few keywords did not lead to any results: videocassette recorders, totes, umbrellas. Similarly, for US, no results were for totes and umbrellas. The search results have been analyzed in order to check if there were evidence of price steering, based on users' location. One term of usage applies: In any research product whose findings are based on this dataset, please cite @inproceedings{DBLP:conf/ircdl/CozzaHPN19, author = {Vittoria Cozza and Van Tien Hoang and Marinella Petrocchi and Rocco {De Nicola}}, title = {Transparency in Keyword Faceted Search: An Investigation on Google Shopping}, booktitle = {Digital Libraries: Supporting Open Science - 15th Italian Research Conference on Digital Libraries, {IRCDL} 2019, Pisa, Italy, January 31 - February 1, 2019, Proceedings}, pages = {29--43}, year = {2019}, crossref = {DBLP:conf/ircdl/2019}, url = {https://doi.org/10.1007/978-3-030-11226-4\_3}, doi = {10.1007/978-3-030-11226-4\_3}, timestamp = {Fri, 18 Jan 2019 23:22:50 +0100}, biburl = {https://dblp.org/rec/bib/conf/ircdl/CozzaHPN19}, bibsource = {dblp computer science bibliography, https://dblp.org} }

本数据集包含约2000个HTML页面:这些网页为2016年7月期间,由多组模拟用户(synthetic users)通过谷歌购物(Google Shopping)美国版,从不同地理位置发起不同产品查询后获取的搜索结果。集合中的每个文件命名遵循如下规则:no_email_LOCATION_USERID.PRODUCT.shopping_testing.#.html,文件名中标注了搜索发起的地理位置、用户ID以及查询的产品。 本次实验涉及的地理位置包括菲律宾(Philippines,PHI)、美国(United States,US)与印度(India,IN);用户ID分配规则为:菲律宾用户对应ID 26至30,美国用户对应ID 1至5,印度用户对应ID 11至15。查询产品基于130个关键词,例如MP3播放器(MP3 player)、MP4手表(MP4 Watch)、个人事务管理器(Personal organizer)、电视机(Television)等。 下文将详述搜索结果的采集流程:每位用户均使用全新的浏览配置文件(profile),创建新配置文件对应启动一个全新的、隔离的Web浏览器客户端实例,并打开谷歌购物美国版网页。为模拟真实用户行为,模拟用户可执行浏览、页面滚动、页面停留以及链接点击等操作。本研究使用完整功能的Web浏览器以获取目标网站的桌面版页面,这是因为网站会根据用户代理(User Agent)呈现差异化的页面表现,同一网站的移动端与桌面版界面存在显著差异即为明证。 页面展示的零售价格以美元(US dollars)计价,且不含运费。现有诸多用于交互浏览器并分析搜索引擎结果的框架,本研究采用OpenWPM。OpenWPM通过Selenium实现自动化,可高效创建并管理带有独立Cookie的Firefox与Chrome客户端实例,以实现不同用户的隔离。实验平均耗时24小时,每轮实验均运行于本地服务器,但浏览器流量通过SOCKS代理(SOCKS proxies)隧道重定向至指定远程服务器(即印度地区服务器),确保所有指令可同时通过所有代理分布式执行。实验使用Mozilla Firefox浏览器45.0版本完成网页浏览任务,运行环境为Ubuntu 14.04操作系统。 针对每个查询,本研究仅考虑搜索结果的第一页,每页包含40个产品,实验重点关注前10条与前3条结果。受连接错误影响,其中一个菲律宾用户配置文件未获取到任何搜索结果;此外,菲律宾地区部分关键词未返回任何结果,包括录像机(videocassette recorders)、手提包(totes)、雨伞(umbrellas);同理,美国地区未返回手提包与雨伞相关的搜索结果。 本数据集的搜索结果旨在分析是否存在基于用户地理位置的价格导向(price steering)现象。本数据集的使用需遵循以下条款:若研究成果基于本数据集的发现,请引用以下BibTeX格式的学术论文: @inproceedings{DBLP:conf/ircdl/CozzaHPN19, author = {Vittoria Cozza and Van Tien Hoang and Marinella Petrocchi and Rocco {De Nicola}}, title = {Transparency in Keyword Faceted Search: An Investigation on Google Shopping}, booktitle = {Digital Libraries: Supporting Open Science - 15th Italian Research Conference on Digital Libraries, {IRCDL} 2019, Pisa, Italy, January 31 - February 1, 2019, Proceedings}, pages = {29--43}, year = {2019}, crossref = {DBLP:conf/ircdl/2019}, url = {https://doi.org/10.1007/978-3-030-11226-4_3}, doi = {10.1007/978-3-030-11226-4_3}, timestamp = {Fri, 18 Jan 2019 23:22:50 +0100}, biburl = {https://dblp.org/rec/bib/conf/ircdl/CozzaHPN19}, bibsource = {dblp computer science bibliography, https://dblp.org} }
创建时间:
2023-06-28
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作