mjbommar/SHELF
收藏Hugging Face2025-12-14 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/mjbommar/SHELF
下载链接
链接失效反馈官方服务:
资源简介:
SHELF是一个合成基准数据集,用于评估语言模型在书目分类、检索和聚类任务上的适应性。数据集包含42,532个合成文档,标注了美国国会图书馆的分类法,包括LCC(国会图书馆分类)、LCGFT(国会图书馆体裁/形式术语)、主题、地理、受众和语域等。数据集支持多种任务,如文档分类、文档检索、文档聚类和配对分类。数据集结构包括多个配置,如default、same_lcc_pairs、same_form_pairs等,每个配置有不同的数据字段和分割。数据集是合成的,使用多种前沿语言模型生成,并经过质量过滤和注释。
SHELF is a synthetic benchmark for evaluating language model fitness on bibliographic classification, retrieval, and clustering tasks using Library of Congress taxonomies. The dataset contains 42,532 synthetic documents annotated with Library of Congress taxonomies, including LCC (Library of Congress Classification), LCGFT (Library of Congress Genre/Form Terms), topics, geographic, audience, and register. The dataset supports various tasks such as document classification, document retrieval, document clustering, and pair classification. The dataset structure includes multiple configurations like default, same_lcc_pairs, same_form_pairs, etc., each with different data fields and splits. The dataset is synthetically generated using multiple frontier language models and has undergone quality filtering and annotations.
提供机构:
mjbommar



