five

goker/comp-serp-data-v2

收藏
Hugging Face2026-02-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/goker/comp-serp-data-v2
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - en pretty_name: "Comprehensive SERP Data (v2)" tags: - serp - search-engines - information-retrieval - web-scraping - reproducibility --- # Comprehensive SERP Data (v2) A reproducible dataset for studying how news-derived search terms propagate through multiple search engines, and how technical/content/accessibility signals vary across ranked results. This repository contains: - the query/keyword seeds, - SERP indexing outputs, - the derived feature dataset used for analysis, - and documentation reports describing acceptance, outliers, and dataset health. ## What’s inside (high level) ### Core inputs (analysis seeds) - `keywords.csv`: news headline–derived search terms and related metadata (title, url, section, pub_date, word_count, text_saved_path). (Report: ~3.3k rows; date range spans 2025-10 to 2026-01.) ### Index & derived data products - `index.parquet`: indexed SERP results (record-level index; includes engine, rank, url, collection metadata). Check the [`index_stats.md`](data/reports/index_stats.md) for more details. - `dataset-20260222_141303.parquet`: final derived dataset used in analysis. The dataset was generated from `index.parquet` with an acceptance configuration and optional outlier handling; see `reports/` for full details. ## Dataset snapshot (this release) - Final dataset file: `dataset-20260222_141303.parquet` - Generated: 2026-02-22 - Total records (final): 86,563 - Engines: Google, Brave, Mojeek - Rank range: 1–20 ### Acceptance / filtering overview The final dataset is produced by applying an acceptance policy (e.g., status=ok, HTTP 200, minimum content, required similarity fields, and required runtime/accessibility metrics). See: - [`reports/dataset-20260222_141303.md`](data/reports/dataset-20260222_141303.md) (generation report) ### Outliers & consistency Outlier detection and extraction consistency checks are summarized in: - [`reports/outlier_report.md`](data/reports/outlier_report.md) ## Repository structure (recommended) ``` ├── data/ │ ├── keywords.csv │ ├── index.parquet │ └── dataset-20260222_141303.parquet ├── raw/ │ ├── scraping/ │ ├── serp/ └── reports/ ├── dataset-20260222_141303.md ├── extraction_*.md ├── index_stats.md └── outlier_report.md ``` ## How to use Our collection, extraction, and analysis pipeline is available in the [serp-profiler-kit](https://github.com/gokerDev/serp-profiler-kit) repository. ## License This project is licensed under the MIT License - see the LICENSE file for details.
提供机构:
goker
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作