Curlie Enhanced with LLM Annotations: Two Datasets for Advancing Homepage2Vec's Multilingual Website Classification

NIAID Data Ecosystem2026-05-01 收录

下载链接：

https://zenodo.org/record/10413067

下载链接

链接失效反馈

官方服务：

资源简介：

Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification This dataset contains two subsets of labeled website data, specifically created to enhance the performance of Homepage2Vec, a multi-label model for website classification. The datasets were generated using Large Language Models (LLMs) to provide more accurate and diverse topic annotations for websites, addressing a limitation of existing Homepage2Vec training data. Key Features: LLM-generated annotations: Both datasets feature website topic labels generated using LLMs, a novel approach to creating high-quality training data for website classification models. Improved multi-label classification: Fine-tuning Homepage2Vec with these datasets has been shown to improve its macro F1 score from 38% to 43% evaluated on a human-labeled dataset, demonstrating their effectiveness in capturing a broader range of website topics. Multilingual applicability: The datasets facilitate classification of websites in multiple languages, reflecting the inherent multilingual nature of Homepage2Vec. Dataset Composition: curlie-gpt3.5-10k: 10,000 websites labeled using GPT-3.5, context 2 and 1-shot curlie-gpt4-10k: 10,000 websites labeled using GPT-4, context 2 and zero-shot Intended Use: Fine-tuning and advancing Homepage2Vec or similar website classification models Research on LLM-generated datasets for text classification tasks Exploration of multilingual website classification Additional Information: Project and report repository: https://github.com/CS-433/ml-project-2-mlp Acknowledgments: This dataset was created as part of a project at EPFL's Data Science Lab (DLab) in collaboration with Prof. Robert West and Tiziano Piccardi.

创建时间：

2023-12-21

5,000+

优质数据集

54 个

任务类型

进入经典数据集