gliclass-v3-logic-dataset
收藏魔搭社区2025-12-04 更新2025-07-26 收录
下载链接:
https://modelscope.cn/datasets/knowledgator/gliclass-v3-logic-dataset
下载链接
链接失效反馈官方服务:
资源简介:

# GLiClass‑V3 Logic Dataset
**Rows** 7 776 | **Split** train only | **Format** Parquet | **Language** EN | **License** Apache‑2.0
## What it is
A length‑balanced corpus of single‑sentence prompts built purely for inducing reasoning in language models.
## Why it helps
* Teaches symbolic‑logic patterns and multi‑label behaviour.
* Buckets cover 15 word‑length ranges (4 → 1,024) in equal proportions, exposing models to both tiny and very long inputs.
* Each example has **1‑50 true** and **1‑50 false** labels, forcing the model to cope with large, variable answer sets.
## Where the prompts come from
Re‑annotated snippets drawn from three public resources:
| Source dataset | Notes |
|----------------|-------|
| **FineWeb** (clean web crawl) | Plain sentences automatically filtered for quality, then labelled with LLM. |
| **tau/CommonsenseQA** | Question stems only; each converted to a declarative premise and re‑labelled multi‑label style. |
| **GLiClass‑2k prototype** (`BioMike/formal‑logic‑reasoning‑gliclass‑2k`) | Earlier formal‑logic items. |
| **nyu‑mll/MultiNLI** | Premise/hypothesis pairs. |
## Data schema
| Column | Type | Notes |
|---------------|-----------------|------------------------------------------|
| `text` | string | Sentence or short passage. |
| `true_labels` | list\<string\> | All correct answers. |
| `all_labels` | list\<string\> | `true_labels` + distractors (shuffled). |
## Quick load
```python
from datasets import load_dataset
ds = load_dataset("knowledgator/gliclass-v3-logic-dataset")["train"]
```
## Citation
```bibtex
@misc{stepanov2025gliclassgeneralistlightweightmodel,
title={GLiClass: Generalist Lightweight Model for Sequence Classification Tasks},
author={Ihor Stepanov and Mykhailo Shtopko and Dmytro Vodianytskyi and Oleksandr Lukashov and Alexander Yavorskyi and Mykyta Yaroshenko},
year={2025},
eprint={2508.07662},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2508.07662},
}
```

# GLiClass-V3 逻辑数据集(GLiClass‑V3 Logic Dataset)
**行数** 7 776 | **拆分集** 仅训练集 | **格式** Parquet | **语言** 英语 | **许可协议** Apache‑2.0
## 数据集概述
这是一个专为诱导大语言模型(Large Language Model, LLM)推理而构建的长度均衡单句提示语料库。
## 数据集价值
* 传授符号逻辑模式与多标签分类行为。
* 数据分桶覆盖15个词长区间(4→1024词),比例均等,让模型同时接触极短与极长输入样本。
* 每个样本包含**1至50个正标签**与**1至50个负标签**,迫使模型处理规模可变的庞大答案集合。
## 提示语来源
本数据集的提示语取自三个公开资源,并经过重新标注:
| 源数据集 | 备注 |
|----------------|-------|
| **FineWeb**(清理后的网页爬虫数据集) | 经自动质量过滤的纯语句,由大语言模型完成标注。 |
| **tau/CommonsenseQA** | 仅保留问题题干,转换为陈述性前提后,以多标签风格重新标注。 |
| **GLiClass‑2k 原型数据集**(`BioMike/formal‑logic‑reasoning‑gliclass‑2k`) | 早期的形式逻辑推理样本。 |
| **nyu-mll/MultiNLI** | 前提与假设对样本。 |
## 数据模式
| 列名 | 数据类型 | 备注 |
|---------------|-----------------|------------------------------------------|
| `text` | 字符串 | 语句或短篇文本。 |
| `true_labels` | 字符串列表 | 全部正确答案。 |
| `all_labels` | 字符串列表 | `true_labels` + 干扰项(已打乱)。 |
## 快速加载
python
from datasets import load_dataset
ds = load_dataset("knowledgator/gliclass-v3-logic-dataset")["train"]
## 引用
bibtex
@misc{stepanov2025gliclassgeneralistlightweightmodel,
title={GLiClass: Generalist Lightweight Model for Sequence Classification Tasks},
author={Ihor Stepanov and Mykhailo Shtopko and Dmytro Vodianytskyi and Oleksandr Lukashov and Alexander Yavorskyi and Mykyta Yaroshenko},
year={2025},
eprint={2508.07662},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2508.07662},
}
提供机构:
maas
创建时间:
2025-07-19



