shebatec/domain-resurrect
收藏Hugging Face2025-12-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/shebatec/domain-resurrect
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: rank
dtype: int64
- name: domain
dtype: large_string
- name: out_weight
dtype: int64
- name: score
dtype: float64
- name: spamicity
dtype: float64
splits:
- name: train
num_bytes: 2794302006
num_examples: 65973832
download_size: 1902149524
dataset_size: 2794302006
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: mit
---
# Dataset Card for domain-resurrect-edges
This dataset contains link counts between domains on the Internet. The data is based on [CommonCrawl](https://commoncrawl.org).
See [mbrt/domain-resurrect-edges](https://huggingface.co/datasets/mbrt/domain-resurrect) for the companion dataset from which this was computed, and
the [Blog post](https://blog.mbrt.dev/posts/domain-resurrect/) on how this was done.
## Dataset Details
### Dataset Description
This dataset is a processed version of the [CommonCrawl September crawl](https://commoncrawl.org/blog/september-2025-crawl-archive-now-available).
Each row represents a domain name (more precisely, an eTLD+1 domain) on the Internet, along with the following information:
* `rank`: what is its position in the popularity ranking (lower is better).
* `domain`: the domain name.
* `out_weight`: how many outgoing links were found in it.
* `score`: the popularity score [0, 1], from which the ranking was set.
* `spamicity`: a measure of "spam mass", which roughly states how likely it is that by following some links in this domain we will end up in some spam site. Note that scores are not probabilities in this case, so a score of 1 doesn't imply certainty of spam, just that some arbitrary threshold of "spamicity" was reached.
Rows are sorted based on `rank` (more popular domains first).
- **Curated by:** [Michele Bertasi](https://mbrt.dev)
- **License:** MIT
### Dataset Source
- **Repository:** https://github.com/mbrt/domain-resurrect
- **Blog:** https://blog.mbrt.dev/posts/domain-resurrect/
- **Source data:** [CommonCrawl September crawl](https://commoncrawl.org/blog/september-2025-crawl-archive-now-available)
提供机构:
shebatec



