isaacus/echr-retrieval

Name: isaacus/echr-retrieval
Creator: isaacus
Published: 2026-02-20 07:58:53
License: 暂无描述

Hugging Face2026-02-20 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/isaacus/echr-retrieval

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: ECHR Retrieval task_categories: - text-retrieval - summarization - text-ranking tags: - legal - law - judicial - eu source_datasets: - HUDOC language: - en annotations_creators: - found language_creators: - found license: cc-by-4.0 size_categories: - n<1K dataset_info: - config_name: default features: - name: query-id dtype: string - name: corpus-id dtype: string - name: score dtype: float64 splits: - name: test num_examples: 200 - config_name: corpus features: - name: _id dtype: string - name: title dtype: string - name: text dtype: string splits: - name: corpus num_examples: 200 - config_name: queries features: - name: _id dtype: string - name: text dtype: string splits: - name: queries num_examples: 200 configs: - config_name: default data_files: - split: test path: default.jsonl - config_name: corpus data_files: - split: corpus path: corpus.jsonl - config_name: queries data_files: - split: queries path: queries.jsonl --- # ECHR Retrieval 🏛️ **ECHR Retrieval** by [Isaacus](https://isaacus.com/) is a challenging legal information retrieval evaluation dataset consisting of 200 short summaries of findings of European Court of Human Rights decisions paired with the text of those decisions sourced from the [HUDOC](https://hudoc.echr.coe.int/) database. This dataset is intended to stress test the ability of an information retrieval model to retrieve relevant court decisions given arbitrary legal holdings. This dataset forms part of the [Massive Legal Embeddings Benchmark (MLEB)](https://isaacus.com/mleb), the largest, most diverse, and most comprehensive benchmark for legal text embedding models. ECHR Retrieval was added to MLEB on 20 February 2026. ## Structure 🗂️ As per the MTEB information retrieval dataset format, this dataset comprises three splits, `default`, `corpus` and `queries`. The `default` split pairs summaries (`query-id`) with decisions (`corpus-id`), each pair having a `score` of 1. The `corpus` split contains European Court of Human Rights decisions, with the text of decisions being stored in the `text` key and their ids being stored in the `_id` key. There is also a `title` column which is deliberately set to an empty string in all cases for compatibility with the [`mteb`](https://github.com/embeddings-benchmark/mteb) library. The `queries` split contains summaries of the findings of decisions, with the text of summaries being stored in the `text` key and their ids being stored in the `_id` key. ## Methodology 🧪 This dataset was constructed by collecting all publicly available European Court of Human Rights decisions, cleaning them, and then sampling 200 summary-decision pairs for inclusion in this dataset. ## License 📜 This dataset is licensed under [CC BY 4.0](https://choosealicense.com/licenses/cc-by-4.0/) which allows for both non-commercial and commercial use of this dataset as long as appropriate attribution is made to it. ## Citation 🔖 If you use this dataset, please cite the [Massive Legal Embeddings Benchmark (MLEB)](https://arxiv.org/abs/2510.19365): ```bibtex @misc{butler2025massivelegalembeddingbenchmark, title={The Massive Legal Embedding Benchmark (MLEB)}, author={Umar Butler and Abdur-Rahman Butler and Adrian Lucas Malec}, year={2025}, eprint={2510.19365}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.19365}, } ```

提供机构：

isaacus

5,000+

优质数据集

54 个

任务类型

进入经典数据集