gliclass-v3-logic-dataset

Name: gliclass-v3-logic-dataset
Creator: maas
Published: 2025-12-04 16:42:12
License: 暂无描述

魔搭社区2025-12-04 更新2025-07-26 收录

下载链接：

https://modelscope.cn/datasets/knowledgator/gliclass-v3-logic-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6405f62ba577649430be5124/I9RAQol7giilBHbbf2T7M.png) # GLiClass‑V3 Logic Dataset **Rows**  7 776 | **Split**  train only | **Format**  Parquet | **Language**  EN | **License**  Apache‑2.0 ## What it is A length‑balanced corpus of single‑sentence prompts built purely for inducing reasoning in language models. ## Why it helps * Teaches symbolic‑logic patterns and multi‑label behaviour. * Buckets cover 15 word‑length ranges (4 → 1,024) in equal proportions, exposing models to both tiny and very long inputs. * Each example has **1‑50 true** and **1‑50 false** labels, forcing the model to cope with large, variable answer sets. ## Where the prompts come from Re‑annotated snippets drawn from three public resources: | Source dataset | Notes | |----------------|-------| | **FineWeb** (clean web crawl) | Plain sentences automatically filtered for quality, then labelled with LLM. | | **tau/CommonsenseQA** | Question stems only; each converted to a declarative premise and re‑labelled multi‑label style. | | **GLiClass‑2k prototype** (`BioMike/formal‑logic‑reasoning‑gliclass‑2k`) | Earlier formal‑logic items. | | **nyu‑mll/MultiNLI** | Premise/hypothesis pairs. | ## Data schema | Column | Type | Notes | |---------------|-----------------|------------------------------------------| | `text` | string | Sentence or short passage. | | `true_labels` | list\<string\> | All correct answers. | | `all_labels` | list\<string\> | `true_labels` + distractors (shuffled). | ## Quick load ```python from datasets import load_dataset ds = load_dataset("knowledgator/gliclass-v3-logic-dataset")["train"] ``` ## Citation ```bibtex @misc{stepanov2025gliclassgeneralistlightweightmodel, title={GLiClass: Generalist Lightweight Model for Sequence Classification Tasks}, author={Ihor Stepanov and Mykhailo Shtopko and Dmytro Vodianytskyi and Oleksandr Lukashov and Alexander Yavorskyi and Mykyta Yaroshenko}, year={2025}, eprint={2508.07662}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2508.07662}, } ```

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6405f62ba577649430be5124/I9RAQol7giilBHbbf2T7M.png) # GLiClass-V3 逻辑数据集（GLiClass‑V3 Logic Dataset） **行数**  7 776 | **拆分集**  仅训练集 | **格式**  Parquet | **语言**  英语 | **许可协议**  Apache‑2.0 ## 数据集概述这是一个专为诱导大语言模型（Large Language Model, LLM）推理而构建的长度均衡单句提示语料库。 ## 数据集价值 * 传授符号逻辑模式与多标签分类行为。 * 数据分桶覆盖15个词长区间（4→1024词），比例均等，让模型同时接触极短与极长输入样本。 * 每个样本包含**1至50个正标签**与**1至50个负标签**，迫使模型处理规模可变的庞大答案集合。 ## 提示语来源本数据集的提示语取自三个公开资源，并经过重新标注： | 源数据集 | 备注 | |----------------|-------| | **FineWeb**（清理后的网页爬虫数据集） | 经自动质量过滤的纯语句，由大语言模型完成标注。 | | **tau/CommonsenseQA** | 仅保留问题题干，转换为陈述性前提后，以多标签风格重新标注。 | | **GLiClass‑2k 原型数据集**（`BioMike/formal‑logic‑reasoning‑gliclass‑2k`） | 早期的形式逻辑推理样本。 | | **nyu-mll/MultiNLI** | 前提与假设对样本。 | ## 数据模式 | 列名 | 数据类型 | 备注 | |---------------|-----------------|------------------------------------------| | `text` | 字符串 | 语句或短篇文本。 | | `true_labels` | 字符串列表 | 全部正确答案。 | | `all_labels` | 字符串列表 | `true_labels` + 干扰项（已打乱）。 | ## 快速加载 python from datasets import load_dataset ds = load_dataset("knowledgator/gliclass-v3-logic-dataset")["train"] ## 引用 bibtex @misc{stepanov2025gliclassgeneralistlightweightmodel, title={GLiClass: Generalist Lightweight Model for Sequence Classification Tasks}, author={Ihor Stepanov and Mykhailo Shtopko and Dmytro Vodianytskyi and Oleksandr Lukashov and Alexander Yavorskyi and Mykyta Yaroshenko}, year={2025}, eprint={2508.07662}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2508.07662}, }

提供机构：

maas

创建时间：

2025-07-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集