ClueWeb12
收藏Research Data Australia2024-12-14 收录
下载链接:
https://researchdata.edu.au/clueweb12/616603
下载链接
链接失效反馈官方服务:
资源简介:
The ClueWeb12 dataset was created to support research on information retrieval and related human language technologies. The dataset consists of 870,043,929 English web pages, collected between February 10, 2012 and May 10, 2012. ClueWeb12 is a companion or successor to the ClueWeb09 web dataset. Distribution of ClueWeb12 began in January 2013.
Significance statement
The ClueWeb12 dataset is the largest, most complete sample dataset of the broader internet readily available for academic research. Developed by Carnegie-Mellon University, it is commonly seen as the benchmark for large scale information retrieval experiments.
Clueweb’s size requires its distribution through standalone HDDs, delivered via international freight. The main savings and efficiencies associated with mirroring this data will come from reductions in bandwidth requirements by permanently co-locating an accessible copy of the full dataset close to computational assets. Subject to Clueweb licensing requirements, the hosting of the dataset will allow other researchers access without repeating the order/delivery/upload process.
In regards to our immediate use of this data, our current project relates to social analytics. A common catchphrase in current literature, our research differs markedly by focusing upon the emerging field of large-scale digital forensics. Whereas commercially focused research may identify potential customers for a product, our research is designed to identify indicators of illegal and/or dangerous behavior – for example, child exploitation or recruitment to violent extremism. To date, we have presented our proposals and early progress to international and domestic counter terrorism/extremism researchers and organisations, with a great of deal of interest emerging from foreign government and research bodies.
ClueWeb12数据集(ClueWeb12)旨在支持信息检索及相关人类语言技术领域的研究。该数据集包含870,043,929份英文网页,采集时间为2012年2月10日至2012年5月10日。ClueWeb12是ClueWeb09网页数据集(ClueWeb09)的配套数据集或升级继任版本,其分发工作于2013年1月启动。
### 重要性说明
ClueWeb12数据集是目前可便捷用于学术研究的、覆盖范围最广且最为完整的互联网抽样数据集。该数据集由卡内基梅隆大学(Carnegie-Mellon University)研发,通常被视为大规模信息检索实验的基准数据集。
由于ClueWeb12的数据体量庞大,其分发需通过独立硬盘驱动器(HDD)进行,并依托国际货运完成配送。对该数据集进行镜像所带来的主要成本节约与效率提升,将源于通过将完整数据集的可访问副本永久部署在计算资源附近,从而降低带宽需求。在符合ClueWeb12许可协议要求的前提下,托管该数据集可使其他研究人员无需重复执行订购、配送及上传流程即可获取数据。
就本团队对该数据集的即时应用而言,当前研究项目聚焦于社会分析领域。尽管社会分析是当前学术文献中的热门研究方向,但本研究的独到之处在于专注于新兴的大规模数字取证领域。与以商业化为导向的研究(旨在为产品挖掘潜在客户)不同,本研究的目标是识别非法或危险行为的相关特征——例如儿童剥削或暴力极端主义招募活动。截至目前,本团队已向国内外反恐及反极端主义研究人员与机构展示了研究方案与初步进展,获得了外国政府及研究机构的广泛关注。
提供机构:
Monash University
搜集汇总
数据集介绍

背景与挑战
背景概述
ClueWeb12是一个大规模英文网页数据集,包含2012年收集的8.7亿个网页,专为信息检索研究设计。其特点包括:需物理介质分发、支持数字取证等前沿研究,并由卡内基梅隆大学开发,是学术界的标准基准数据集。
以上内容由遇见数据集搜集并总结生成



