Comprehensive set of Sitemap and robots.txt links extracted from Common Crawl
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/10511291
下载链接
链接失效反馈官方服务:
资源简介:
This is a comprehensive list of links to sitemaps and robots.txt files, which are extracted from the latest WARC Archive dump 2023-50 of robots.txt files.
Sitemaps:
32,252,027 links (all ending with .xml or .xml.gz); 395.2 MB (compressed)
Website categories; 2.2 MB (compressed)
Top level labels of Curlie.org directory
Number of sitemap links
Arts
20110
Business
68690
Computers
17404
Games
3068
Health
13999
Home
4130
Kids_and_Teens
2240
News
5855
Recreation
19273
Reference
10862
Regional
419
Science
10729
Shopping
29903
Society
35019
Sports
12597
Robots.txt files:
41,611,704 links; 440.9 MB (compressed)
Website categories; 2.7 MB (compressed)
Top level labels of Curlie.org directory
Number of robots.txt links
Arts
25281
Business
79497
Computers
21880
Games
5037
Health
17326
Home
5401
Kids_and_Teens
3753
News
3424
Recreation
26355
Reference
15404
Regional
678
Science
16500
Shopping
30266
Society
45397
Sports
18029
创建时间:
2024-03-08



