five

shebatec/domain-resurrect

收藏
Hugging Face2025-12-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/shebatec/domain-resurrect
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: rank dtype: int64 - name: domain dtype: large_string - name: out_weight dtype: int64 - name: score dtype: float64 - name: spamicity dtype: float64 splits: - name: train num_bytes: 2794302006 num_examples: 65973832 download_size: 1902149524 dataset_size: 2794302006 configs: - config_name: default data_files: - split: train path: data/train-* license: mit --- # Dataset Card for domain-resurrect-edges This dataset contains link counts between domains on the Internet. The data is based on [CommonCrawl](https://commoncrawl.org). See [mbrt/domain-resurrect-edges](https://huggingface.co/datasets/mbrt/domain-resurrect) for the companion dataset from which this was computed, and the [Blog post](https://blog.mbrt.dev/posts/domain-resurrect/) on how this was done. ## Dataset Details ### Dataset Description This dataset is a processed version of the [CommonCrawl September crawl](https://commoncrawl.org/blog/september-2025-crawl-archive-now-available). Each row represents a domain name (more precisely, an eTLD+1 domain) on the Internet, along with the following information: * `rank`: what is its position in the popularity ranking (lower is better). * `domain`: the domain name. * `out_weight`: how many outgoing links were found in it. * `score`: the popularity score [0, 1], from which the ranking was set. * `spamicity`: a measure of "spam mass", which roughly states how likely it is that by following some links in this domain we will end up in some spam site. Note that scores are not probabilities in this case, so a score of 1 doesn't imply certainty of spam, just that some arbitrary threshold of "spamicity" was reached. Rows are sorted based on `rank` (more popular domains first). - **Curated by:** [Michele Bertasi](https://mbrt.dev) - **License:** MIT ### Dataset Source - **Repository:** https://github.com/mbrt/domain-resurrect - **Blog:** https://blog.mbrt.dev/posts/domain-resurrect/ - **Source data:** [CommonCrawl September crawl](https://commoncrawl.org/blog/september-2025-crawl-archive-now-available)
提供机构:
shebatec
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作