KRLabsOrg/verbatim-spans

Name: KRLabsOrg/verbatim-spans
Creator: KRLabsOrg
Published: 2026-04-24 17:38:54
License: 暂无描述

Hugging Face2026-04-24 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/KRLabsOrg/verbatim-spans

下载链接

链接失效反馈

官方服务：

资源简介：

Verbatim Spans是一个多领域的训练数据集，用于查询条件下的提取性证据选择任务。给定一个问题和一个段落，任务是高亮段落中支持答案的逐字子字符串。数据集结合了三个不同领域和标注规范的来源：ACL silver（NLP研究论文）、RAGBench（金融/医疗/法律/一般QA）和Squeez（代码/SWE-bench工具输出）。数据集的设计目的是训练一个通用的跨度高亮编码器，适用于ModernBERT标记分类器。数据集包含两个配置：canonical（每对（问题，块）一行，包含原始文本）和encoder（预标记化，准备用于直接训练）。数据集的所有标签都是由LLM生成的，不是严格的人类标注。数据集的使用许可为Apache 2.0。

Verbatim Spans is a multi-domain training dataset for query-conditioned extractive evidence selection. Given a question and a passage, the task is to highlight the verbatim substrings of the passage that support the answer. The dataset combines three sources covering distinct domains and annotation conventions: ACL silver (NLP research papers), RAGBench (finance/medical/legal/general QA), and Squeez (code/SWE-bench tool outputs). The dataset is designed for training a generic span-highlighter encoder, intended for use with a ModernBERT token classifier. It includes two configs: canonical (one row per (question, chunk) pair with raw text) and encoder (pretokenized, ready for direct training). All labels are LLM-produced, not strictly human-annotated. The dataset is licensed under Apache 2.0.

提供机构：

KRLabsOrg

5,000+

优质数据集

54 个

任务类型

进入经典数据集