commoncrawl/gneissweb-annotation-host-testing-v1
收藏Hugging Face2025-12-11 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/commoncrawl/gneissweb-annotation-host-testing-v1
下载链接
链接失效反馈官方服务:
资源简介:
GneissWeb注释数据集是一个应用于Common Crawl语料库的质量和类别注释数据集,由IBM Research的GneissWeb方法提供支持。该数据集支持对医疗、教育、技术和科学领域网络内容的精确过滤,便于为研究项目、语言模型和专业应用构建高质量语料库。数据集包含两个层次的注释粒度:主机级别(整个域的聚合统计)和URL级别(单个URL分类)。数据集利用了IBM公开提供的GneissWeb布隆过滤器、IBM的数据准备工具包(现为Linux Foundation AI & Data项目)以及GneissWeb组的类别分类器。
GneissWeb Annotations is a dataset of quality and category annotations applied to the Common Crawl corpus, powered by IBM Researchs GneissWeb methodology. This dataset enables precise filtering of web content across medical, educational, technology, and scientific domains, making it easier to build high-quality corpora for research projects, language models, and specialized applications. The dataset provides annotations at two levels of granularity: host-level (aggregate statistics for entire domains) and URL-level (individual URL classifications). It utilizes the GneissWeb bloom filter made publicly available by IBM, along with IBM’s Data Prep Kit (now a Linux Foundation AI & Data project) and the GneissWeb groups’ category classifiers.
提供机构:
commoncrawl



