disi-unibo-nlp/legal-link-eu
收藏Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/disi-unibo-nlp/legal-link-eu
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Legal-Link-EU
language:
- en
task_categories:
- question-answering
task_ids:
- multiple-choice-qa
tags:
- legal
- european-union
- eur-lex
- retrieval-augmented-generation
- sycophancy
- authority-bias
- robustness
size_categories:
- 1K<n<10K
configs:
- config_name: default
data_files:
- split: test
path: data/test-*
---
# Legal-Link-EU
Legal-Link-EU is a multiple-choice benchmark for evaluating how language models reason over changing legal authority. The dataset is derived from EUR-Lex documents and focuses on relationships between European Union legal acts, such as repeal, correction, obsolescence, and extension of validity or application.
The benchmark is intended for evaluating legal reasoning under retrieval-like conditions. Each example contains a legal question, four answer options, the gold answer, original EUR-Lex context passages, and perturbed context passages. The perturbed passages are designed to look authoritative while supporting an incorrect answer, making the dataset useful for studying whether models defer to false legal evidence when it is presented in a formal source-like style.
## Motivation
Many legal NLP benchmarks evaluate whether a model can match a fact pattern to a legal outcome, classify a clause, or retrieve a relevant authority. These tasks remain important, but they often assume that the relevant legal source is already fixed and valid. In deployed retrieval-augmented systems, that assumption is fragile. A retrieved passage may be superseded, corrected, extended, or formally linked to another act whose legal effect changes the answer.
Legal-Link-EU targets this part of legal reasoning. The benchmark asks whether models can use legal sources as authorities while still tracking the normative relationship between them. This makes it relevant for studying legal QA, retrieval-augmented generation, and broader failures where a model over-trusts authoritative-looking information even when it conflicts with the operative rule.
## Loading
The released dataset has a single `test` split.
```python
from datasets import load_dataset
ds = load_dataset("disi-unibo-nlp/legal-link-eu", split="test")
```
## Field Description
| Field | Type | Description |
|---|---:|---|
| `id` | string | Stable example identifier. The identifier includes the source document id, target document id, and relation type. |
| `question` | string | Legal multiple-choice question. |
| `options` | list[string] | Four candidate answers. |
| `correct_label` | string | Correct option label, one of `A`, `B`, `C`, or `D`. |
| `correct_index` | int | Zero-based index of the correct option. |
| `contexts` | list[string] | Original EUR-Lex context passages used for grounded evaluation. |
| `context_titles` | list[string] | Titles identifying source and target context chunks. |
| `perturbed_contexts` | list[string] | Modified context passages used to test deference to misleading authority. |
## Dataset Statistics
Legal-Link-EU contains 1,127 multiple-choice instances. Each instance has one correct answer and three distractors. The correct-answer positions are near-uniform, with 257 examples labeled A (22.8%), 300 labeled B (26.6%), 274 labeled C (24.3%), and 296 labeled D (26.3%).
The benchmark is exactly stratified across seven EUR-Lex relation types.
| Relation type | Count | Percentage |
|---|---:|---:|
| `completes` | 161 | 14.3% |
| `corrects` | 161 | 14.3% |
| `extends_application` | 161 | 14.3% |
| `extends_validity` | 161 | 14.3% |
| `implicitly_repeals` | 161 | 14.3% |
| `rendered_obsolete_by` | 161 | 14.3% |
| `repeals` | 161 | 14.3% |
Word-level length statistics are computed over the final benchmark instances.
| Quantity | Mean | Std. | Median | Min | Max |
|---|---:|---:|---:|---:|---:|
| Question length | 109.1 | 17.0 | 109 | 59 | 170 |
| Option length | 28.1 | 7.2 | 29 | 1 | 53 |
| Context length | 2327 | 1345 | 2002 | 256 | 6254 |
| Perturbed context length | 2222 | 1262 | 1925 | 264 | 6082 |
The dataset covers 880 distinct document pairs, 762 distinct source documents, and 696 distinct target documents. The temporal range spans 1953 to 2025. Source documents have mean year 2005.6 and median year 2006, while target documents have mean year 2002.5 and median year 2004. Source document types are mostly Regulations (54.3%) and Decisions (35.0%), with Directives (5.9%) and other instruments (4.8%) also represented.
A separate LLM-as-a-jury audit over a 100-item stratified sample assigned a mean reasoning-complexity score of 3.83 on a five-point scale. All audited questions required multi-hop reasoning or above. The distribution was 20% multi-hop, 75% temporal or relational, and 5% complex legal inference.
## Evaluation Use
The dataset supports three common evaluation settings.
| Setting | Input | Purpose |
|---|---|---|
| Closed-book recall | `question`, `options` | Tests whether the model answers from parametric knowledge alone. |
| Grounded legal reasoning | `question`, `options`, `contexts` | Tests whether valid legal context improves the answer. |
| Confidence under misleading authority | `question`, `options`, `perturbed_contexts` | Tests whether the model follows authoritative-looking but false context. |
The third setting is the central stress test. Legal systems depend on authoritative sources, but legal authority is not equivalent to surface plausibility. A model that treats any formal citation-like passage as controlling evidence can fail when the passage is outdated, altered, or inconsistent with the operative rule. Legal-Link-EU is designed to expose this failure mode without reducing legal QA to lexical matching.
---
数据集名称:法律关联欧盟(Legal-Link-EU)
语言:英语
任务类别:问答
任务子类型:多项选择问答
标签:法律、欧盟、EUR-Lex、检索增强生成(retrieval-augmented generation)、谄媚性(sycophancy)、权威偏差(authority-bias)、鲁棒性(robustness)
样本规模:1000 < n < 10000
配置项:
- 配置名称:默认配置
数据文件:
- 拆分集:测试集
路径:data/test-*
---
# 法律关联欧盟(Legal-Link-EU)数据集
法律关联欧盟(Legal-Link-EU)是一款多项选择基准测试集,用于评估大语言模型(LLM)针对动态变化的法律权威文本的推理能力。该数据集源自EUR-Lex官方文档,聚焦欧盟法律文书间的各类关联关系,包括废止、更正、失效、效力或适用范围延展等。
本基准测试集旨在模拟检索场景下的法律推理评估。每个测试样本均包含一道法律问答题、四个候选答案、正确答案、原始EUR-Lex上下文段落以及扰动后的上下文段落。扰动后的段落被设计为看似权威且支持错误答案的文本,借此可用于研究模型是否会顺从以正式来源风格呈现的虚假法律证据。
## 研究动机
现有诸多法律自然语言处理(NLP)基准测试集,多评估模型能否将事实场景匹配至法律结果、对法律条款进行分类,或检索相关法律权威文本。此类任务固然重要,但通常假设相关法律来源已固定且有效。而在实际部署的检索增强生成系统中,这一假设往往不堪一击:检索得到的文本段落可能已被替代、更正、延展,或与另一法律文书正式关联,而该关联的法律效力可能会改变最终答案。
法律关联欧盟(Legal-Link-EU)数据集正是针对这一法律推理场景而设计。本基准测试集旨在探究模型能否在将法律文本作为权威依据的同时,仍能厘清其背后的规范关联关系。因此,该数据集可用于研究法律问答、检索增强生成,以及模型过度信任看似权威的信息(即便该信息与生效规则相悖)的各类失效模式。
## 数据集加载
本数据集仅提供`test`(测试)拆分集。
python
from datasets import load_dataset
ds = load_dataset("disi-unibo-nlp/legal-link-eu", split="test")
## 字段说明
| 字段 | 类型 | 描述 |
|---|---:|---|
| `id` | 字符串 | 稳定的样本标识符,包含源文档ID、目标文档ID以及关联关系类型。 |
| `question` | 字符串 | 法律多项选择题题干。 |
| `options` | 字符串列表 | 四个候选答案。 |
| `correct_label` | 字符串 | 正确答案标签,取值为`A`、`B`、`C`或`D`之一。 |
| `correct_index` | 整数 | 正确答案的从零开始的索引。 |
| `contexts` | 字符串列表 | 用于基准评估的原始EUR-Lex上下文段落。 |
| `context_titles` | 字符串列表 | 用于标识源上下文与目标上下文块的标题。 |
| `perturbed_contexts` | 字符串列表 | 用于测试模型对误导性权威文本顺从性的修改后上下文段落。 |
## 数据集统计信息
法律关联欧盟(Legal-Link-EU)共包含1127个多项选择测试样本,每个样本配有1个正确答案与3个干扰项。正确答案的分布近似均匀:其中257个样本标签为A(占比22.8%)、300个为B(26.6%)、274个为C(24.3%)、296个为D(26.3%)。
本基准测试集严格按照7种EUR-Lex关联关系类型进行分层抽样:
| 关联关系类型 | 样本量 | 占比 |
|---|---:|---:|
| `completes`(补充关联) | 161 | 14.3% |
| `corrects`(更正关联) | 161 | 14.3% |
| `extends_application`(延展适用范围) | 161 | 14.3% |
| `extends_validity`(延展效力) | 161 | 14.3% |
| `implicitly_repeals`(默示废止) | 161 | 14.3% |
| `rendered_obsolete_by`(被宣告失效) | 161 | 14.3% |
| `repeals`(废止) | 161 | 14.3% |
本数据集针对最终测试样本计算了词级长度统计量:
| 统计量 | 均值 | 标准差 | 中位数 | 最小值 | 最大值 |
|---|---:|---:|---:|---:|---:|
| 题干长度 | 109.1 | 17.0 | 109 | 59 | 170 |
| 候选答案长度 | 28.1 | 7.2 | 29 | 1 | 53 |
| 原始上下文长度 | 2327 | 1345 | 2002 | 256 | 6254 |
| 扰动后上下文长度 | 2222 | 1262 | 1925 | 264 | 6082 |
本数据集涵盖880组独立文档对、762个独立源文档与696个独立目标文档,时间跨度为1953年至2025年。源文档的平均发布年份为2005.6年,中位数年份为2006年;目标文档的平均发布年份为2002.5年,中位数年份为2004年。源文档类型以条例(Regulations,占比54.3%)与决定(Decisions,占比35.0%)为主,同时涵盖指令(Directives,5.9%)与其他法律文书(4.8%)。
针对100个分层抽样样本的"大语言模型作为评审"专项审计结果显示,在五分制评分标准下,推理复杂度平均得分为3.83。所有受审题目均需要多步推理及以上能力,其中20%为多步推理题、75%为时序或关联推理题、5%为复杂法律推论题。
## 评估使用场景
本数据集支持三类常见评估设置:
| 设置类型 | 输入字段 | 用途 |
|---|---|---|
| 闭卷召回测试 | `question`(题干)、`options`(候选答案) | 评估模型仅依靠参数化知识作答的能力。 |
| 锚定上下文的法律推理 | `question`(题干)、`options`(候选答案)、`contexts`(原始上下文段落) | 评估合法法律上下文是否能够提升模型作答效果。 |
| 误导性权威文本下的鲁棒性测试 | `question`(题干)、`options`(候选答案)、`perturbed_contexts`(扰动后上下文段落) | 评估模型是否会遵循看似权威实则虚假的上下文文本。 |
第三类设置为核心压力测试。法律体系依赖权威来源,但法律权威性并不等同于表面合理性。若模型将任何带有正式引用风格的段落视为决定性证据,则当该段落过时、被篡改或与生效规则相悖时,模型将出现推理失效。法律关联欧盟(Legal-Link-EU)数据集旨在暴露这一失效模式,同时避免将法律问答任务简化为词汇匹配。
提供机构:
disi-unibo-nlp



