five

Curlie Dataset - Language-agnostic Website Embedding and Classification

收藏
DataCite Commons2022-03-24 更新2024-07-29 收录
下载链接:
https://figshare.com/articles/dataset/Curlie_Dataset_-_Language-agnostic_Website_Embedding_and_Classification/19406693/2
下载链接
链接失效反馈
官方服务:
资源简介:
**************** Full Curlie dataset ****************<br><br>This dataset contains the URL scrapped from curlie.org alongside with their multilingual labels. The label correspond to the sub-category where the URL was referenced in Curlie. We also provide a mapping between english labels and labels from other languages for alignment. The URLs have been filtered to only contain homepages. Each distint URL is indexed with a unique identifier (uid).<br><br>curlie.csv.gz &gt; [url, uid, label, lang] x 2,275,150 samples <br>mapping.json.gz &gt; [english_label, matchings] x 35,946 labels<br><br><br>**************** Processed Curlie dataset ****************<br><br>You find here the data used to train Homepage2vec. URLs have been further filtered out: websites listed under the Regional top-category where dropped, as well as non-accessible websites. This filtering yields 1,018,207 valid URL. The labels are aligned across languages and reduced to the 14 top-categories (classes). <br><br>Because a URL can belong to several classes, a binary vector is used. The grouping yields 885,582 distinct URL, for each of them we provide the HTML content. We also provide a visual encoding, it was obtained by forwarding a screenshot of the homepage trough a ResNet deep-learning model pretrained on ImageNet.<br><br>The training and testing sets are also given.<br><br>curlie_filtered.csv.gz &gt; [url, uid, label, lang] x 1,018,207 samples<br><br>class_vector.json.gz &gt; [uid, class_vector] x 885,582 samples<br>class_names.txt &gt; [class_name] x 14 classes<br><br>html_content.json.gz &gt; [uid, html] x 885,582 samples<br>visual_encoding.json.gz &gt; [uid, visual_encoding] x 885,582 samples<br><br>train_uid.txt &gt; [uid] x 797,023 samples<br>test_uid.txt &gt; [uid] x 88,559 samples<br><br>**************** Pretrained Homepage2Vec****************<br>h2v_1000_100.zip &gt; Model pretrained on all features<br>h2v_1000_100_text_only.zip &gt; Model pretrained only on textual features (no visual features from screenshots)<br><br>**************** Enriched Curlie dataset ****************<br><br>Thanks to Homepage2Vec, we release an enriched version of Curlie. Each URL is associated to a class probability vector and to an embedding in the latent space.<br><br>outputs.json.gz &gt; [url, uid, prediction, embedding] x 885,582 samples<br><br><br>******** Notes ********<br><br>JSON files have one record per line and can be read with Pandas: e.g <i>pandas.read_json(file, orient='records', lines=True, compression='gzip')</i><br><br>
提供机构:
figshare
创建时间:
2022-03-24
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作