five

DMOZ 2006 Dataset and its Wikification

收藏
Mendeley Data2024-03-27 更新2024-06-27 收录
下载链接:
https://data.mendeley.com/datasets/9mpgz8z257
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset was retrieved with a crawler in 2006 from the Open Directory Project (ODP) (http://dmoz.org, https://en.wikipedia.org/wiki/DMOZ), which closed in 2017 and was reborn as Curlie (https://curlie.org/). The topics were selected from the third level of the ODP hierarchy. Some constraints were imposed on this selection to ensure the quality of the dataset. The minimum size for each selected topic was 100 URLs, and the language was restricted to English. For each topic, we collected all of its URLs as well as those in its subtopics. The retrieved HTML was parsed and cleaned to remove empty, pdf, flash, and other not useful files. The total number of collected pages was more than 350K from 448 topics. In 2018 the data was wikified.
创建时间:
2024-01-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作