Curlie Dataset - Language-agnostic Website Embedding and Classification
收藏DataCite Commons2023-01-24 更新2024-07-29 收录
下载链接:
https://figshare.com/articles/dataset/Curlie_Dataset_-_Language-agnostic_Website_Embedding_and_Classification/19406693
下载链接
链接失效反馈官方服务:
资源简介:
**************** Full Curlie dataset **************** Curlie.org is presented as the largest human-edited directory of the Web. It contains 3M+ multilingual webpage classified in a hierarchical taxonomy that is language-specific, but regrouping the same 14 top-level categories. Unfortunately, the Curlie administrators do not provide a downloadable archive of this valuable content. Therefore, we decided to release our own dataset that results from a in-depth scrapping of the Curlie website. This dataset contains webpages URL alongside with the category path (label) where they are referenced in Curlie. For example, the International Ski Federation website (www.fis-ski.com) is referenced under the category path Sports/Winter/Sports/Skiing/Associations. The category path is language-specific and we provide a mapping between english and other languages for alignment. The URLs have been filtered to only contain homepages (URL with empty path). Each distinct URL is indexed with a unique identifier (uid). <strong>curlie.csv.gz </strong>> [url, uid, label, lang] x 2,275,150 samples <strong>mapping.json.gz </strong>> [english_label, matchings] x 35,946 labels **************** Processed Curlie dataset **************** We provide here the ground data used to train Homepage2Vec. URLs have been further filtered out: websites listed under the Regional top-category are dropped, as well as non-accessible websites. This filtering yields 933,416 valid entries. The labels are aligned across languages and reduced to the 14 top-categories (classes). There are 885,582 distinct URLs, for which the associated classes are represented with a binary class vector (an URL can belong to multiple classes). We provide the HTML content for each distinct URL. We also provide a visual encoding, it was obtained by forwarding a screenshot of the homepage trough a ResNet deep-learning model pretrained on ImageNet. Finally, we provide the training and testing sets for reproduction concerns. <strong>curlie_filtered.csv.gz</strong> > [url, uid, label, lang] x 933,416 samples <strong>class_vector.json.gz</strong> > [uid, class_vector] x 885,582 samples <strong>html_content.json.gz</strong> > [uid, html] x 885,582 samples <strong>visual_encoding.json.gz</strong> > [uid, visual_encoding] x 885,582 samples <strong>class_names.txt </strong>> [class_name] x 14 classes <strong>train_uid.txt</strong> > [uid] x 797,023 samples <strong>test_uid.txt</strong> > [uid] x 88,559 samples **************** Enriched Curlie dataset **************** Thanks to Homepage2Vec, we release an enriched version of Curlie. For each distinct URL, we provide the class probability vector (14 classes) and the latent space embedding (100 dimensions). <strong>outputs.json.gz</strong> > [uid, url, score, embedding] x 885,582 samples **************** Pretrained Homepage2Vec**************** <strong>h2v_1000_100.zip</strong> > Model pretrained on all features <strong>h2v_1000_100_text_only.zip</strong> > Model pretrained only on textual features (no visual features from screenshots) **************** Notes **************** CSV file can be read with python: <em>import pandas as pd</em> <em>df = pd.read_csv(“curlie.csv.gz“, index_col=0)</em> JSON files have one record per line and can be read with python: <em>import json</em> <em>import gzip</em> <em>with gzip.open("html_content.json.gz", "rt", encoding="utf-8") as file:</em> <em> for line in file:</em> <em> data = json.loads(line)</em> <em> …</em>
**************** 完整Curlie数据集 ****************
Curlie.org 是全球规模最大的人工编辑网页目录。该数据集包含300万+条多语言网页数据,基于分层分类体系进行标注:各语言拥有独立的分类体系,但所有语言的分类体系均统一对应14个顶级分类。遗憾的是,Curlie官方管理员并未提供该珍贵内容的可下载归档包。为此,我们通过对Curlie官网进行深度爬取,构建并发布了自有数据集。本数据集包含网页URL及其在Curlie中对应的分类路径(标签)。例如,国际滑雪联合会官网(www.fis-ski.com)的分类路径为 Sports/Winter/Sports/Skiing/Associations。分类路径与语言绑定,我们提供了英语与其他语言的标签映射表以实现跨语言对齐。所有URL均经过过滤,仅保留主页(路径为空的URL),每个唯一URL均配有唯一标识符(uid)。
<strong>curlie.csv.gz</strong> > 共2,275,150条样本,格式为[url, uid, label, lang]
<strong>mapping.json.gz</strong> > 共35,946个标签,格式为[english_label, matchings]
**************** 处理后Curlie数据集 ****************
我们在此提供用于训练Homepage2Vec的基准数据集。本次过滤进一步移除了归属区域顶级分类的网站以及无法访问的站点,最终得到933,416条有效条目。所有标签已完成跨语言对齐,并缩减至14个顶级分类。数据集包含885,582个唯一URL,每个URL对应的类别以二进制类别向量表示(一个URL可隶属于多个类别)。我们为每个唯一URL提供其HTML内容,同时提供视觉编码特征:该特征通过将主页截图输入在ImageNet上预训练的ResNet深度学习模型生成。此外,我们还提供了训练集与测试集以方便研究复现。
<strong>curlie_filtered.csv.gz</strong> > 共933,416条样本,格式为[url, uid, label, lang]
<strong>class_vector.json.gz</strong> > 共885,582条样本,格式为[uid, class_vector]
<strong>html_content.json.gz</strong> > 共885,582条样本,格式为[uid, html]
<strong>visual_encoding.json.gz</strong> > 共885,582条样本,格式为[uid, visual_encoding]
<strong>class_names.txt</strong> > 共14个类别名,格式为[class_name]
<strong>train_uid.txt</strong> > 共797,023条样本,格式为[uid]
<strong>test_uid.txt</strong> > 共88,559条样本,格式为[uid]
**************** 增强版Curlie数据集 ****************
依托Homepage2Vec,我们发布了增强版Curlie数据集。针对每个唯一URL,我们提供了对应14个类别的类别概率向量,以及100维的隐空间嵌入特征。
<strong>outputs.json.gz</strong> > 共885,582条样本,格式为[uid, url, score, embedding]
**************** 预训练Homepage2Vec模型 ****************
<strong>h2v_1000_100.zip</strong> > 基于全特征集预训练的模型
<strong>h2v_1000_100_text_only.zip</strong> > 仅基于文本特征预训练的模型(未使用截图的视觉特征)
**************** 使用说明 ****************
CSV文件可通过Python读取,示例代码:
<em>import pandas as pd</em>
<em>df = pd.read_csv("curlie.csv.gz", index_col=0)</em>
JSON文件每行存储一条记录,可通过以下方式读取:
<em>import json</em>
<em>import gzip</em>
<em>with gzip.open("html_content.json.gz", "rt", encoding="utf-8") as file:</em>
<em> for line in file:</em>
<em> data = json.loads(line)</em>
<em> …</em>
提供机构:
figshare
创建时间:
2022-03-23



