five

Comprehensive set of Sitemap and robots.txt links extracted from Common Crawl

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/10511291
下载链接
链接失效反馈
官方服务:
资源简介:
This is a comprehensive list of links to sitemaps and robots.txt files, which are extracted from the latest WARC Archive dump 2023-50 of robots.txt files. Sitemaps: 32,252,027 links (all ending with .xml or .xml.gz); 395.2 MB (compressed) Website categories; 2.2 MB (compressed) Top level labels of Curlie.org directory Number of sitemap links Arts 20110 Business 68690 Computers 17404 Games 3068 Health 13999 Home 4130 Kids_and_Teens 2240 News 5855 Recreation 19273 Reference 10862 Regional 419 Science 10729 Shopping 29903 Society 35019 Sports 12597   Robots.txt files: 41,611,704 links; 440.9 MB (compressed) Website categories; 2.7 MB (compressed) Top level labels of Curlie.org directory Number of robots.txt links Arts 25281 Business 79497 Computers 21880 Games 5037 Health 17326 Home 5401 Kids_and_Teens 3753 News 3424 Recreation 26355 Reference 15404 Regional 678 Science 16500 Shopping 30266 Society 45397 Sports 18029
创建时间:
2024-03-08
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作