five

A Novel Algorithm for Estimating Web Page Ranking in Search Engine Results Pages

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/8358447
下载链接
链接失效反馈
官方服务:
资源简介:
Abstract: Search engine optimization (SEO) can make a big improvement in the traffic to a web page. Because search engines keep their main rules of ranking undeclared, it’s important to develop models that can estimate the ranking of a web page in the search engine to be able to optimize web pages to rank higher in the search engine. The available research methodologies used machine learning algorithms to provide solutions for this target with the help of generated datasets by scraping the search engine results pages (SERP) and crawling web pages. Their proposed models suffered from the inability to be updated dynamically if the search engine updated its ranking algorithm, and their input data did not include the diversity of web pages and languages. This research will propose a novel original rank estimation algorithm that’s able to overcome other research challenges, with a set of comparative experiments and complexity analysis. Results will show that the proposed algorithm could achieve higher values of accuracy, precision, and recall. Dataset:  For research purpose, the dataset will play two roles, first, it will act the role of search engine result pages (SERP), and second, it will be used to test algorithms and calculate performance measurements. Dataset is consisting of 9930 web pages, aimed to identify search results pages, focusing on the top 3 pages of SERP, with 31 extracted attributes that's related to search engine optimization (SEO). The distribution of examples between class labels was balanced, with changes due to scraping operation issues, but not significantly different, with fractions of 39.9%, 34.6%, and 25.5% for the class labels page1, page2, and page 3. Feature names are: 'Title 1 Length', 'Title 2 Length', 'Meta Description 1 Length', 'Meta Description 2 Length', 'Meta Keywords 1 Length', 'H1-1 Length', 'H1-2 Length', 'H2-1 Length', 'H2-2 Length', 'Size (bytes)', 'Word Count', 'Text Ratio', 'Inlinks', 'Unique Inlinks', 'Unique JS Inlinks', '% of Total', 'Outlinks', 'Unique Outlinks', 'Unique JS Outlinks', 'External Outlinks', 'Unique External Outlinks', 'Unique External JS Outlinks', 'Response Time', 'Status Code', 'Keyword in MetaDescription1', 'Keyword in Title1', 'Keyword in MetaKeywords1', 'Keyword in URL', 'Has LastModified', 'Keyword in Headers', and 'Keyword in Emphasized Text'. The process of dataset generation involved scraping the search engine, extracting URLs for selected keywords, focusing on feature extraction, cleaning and preprocessing, and generating new attributes related to keywords in web pages. It involved also removing missing values, duplicates, and data type conversions to obtain a comprehensive dataset. Keyword selection involves selecting keywords from various categories and considering diversity, including high and low traffic, long-term and short-term keywords, and generic and branded keywords. Apify online tool was used for search engine scraping with default language and US country, resulting in 388 selected keywords with 30 results per keyword. Dataset included extracted SEO features from 9991 web pages using screamingFrog desktop software and Rapidminer desktop software, determining page SEO-friendliness and comparing it to SERP rankings. Dataset cleaning involved removing redundant attributes, removing paid SERP results, replacing missing values, and converting data types. Rapidminer was used for data cleaning and preprocessing, generating new attributes related to keyword usage in web pages.
创建时间:
2023-09-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作