five

isaacus/echr-retrieval

收藏
Hugging Face2026-02-20 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/isaacus/echr-retrieval
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: ECHR Retrieval task_categories: - text-retrieval - summarization - text-ranking tags: - legal - law - judicial - eu source_datasets: - HUDOC language: - en annotations_creators: - found language_creators: - found license: cc-by-4.0 size_categories: - n<1K dataset_info: - config_name: default features: - name: query-id dtype: string - name: corpus-id dtype: string - name: score dtype: float64 splits: - name: test num_examples: 200 - config_name: corpus features: - name: _id dtype: string - name: title dtype: string - name: text dtype: string splits: - name: corpus num_examples: 200 - config_name: queries features: - name: _id dtype: string - name: text dtype: string splits: - name: queries num_examples: 200 configs: - config_name: default data_files: - split: test path: default.jsonl - config_name: corpus data_files: - split: corpus path: corpus.jsonl - config_name: queries data_files: - split: queries path: queries.jsonl --- # ECHR Retrieval 🏛️ **ECHR Retrieval** by [Isaacus](https://isaacus.com/) is a challenging legal information retrieval evaluation dataset consisting of 200 short summaries of findings of European Court of Human Rights decisions paired with the text of those decisions sourced from the [HUDOC](https://hudoc.echr.coe.int/) database. This dataset is intended to stress test the ability of an information retrieval model to retrieve relevant court decisions given arbitrary legal holdings. This dataset forms part of the [Massive Legal Embeddings Benchmark (MLEB)](https://isaacus.com/mleb), the largest, most diverse, and most comprehensive benchmark for legal text embedding models. ECHR Retrieval was added to MLEB on 20 February 2026. ## Structure 🗂️ As per the MTEB information retrieval dataset format, this dataset comprises three splits, `default`, `corpus` and `queries`. The `default` split pairs summaries (`query-id`) with decisions (`corpus-id`), each pair having a `score` of 1. The `corpus` split contains European Court of Human Rights decisions, with the text of decisions being stored in the `text` key and their ids being stored in the `_id` key. There is also a `title` column which is deliberately set to an empty string in all cases for compatibility with the [`mteb`](https://github.com/embeddings-benchmark/mteb) library. The `queries` split contains summaries of the findings of decisions, with the text of summaries being stored in the `text` key and their ids being stored in the `_id` key. ## Methodology 🧪 This dataset was constructed by collecting all publicly available European Court of Human Rights decisions, cleaning them, and then sampling 200 summary-decision pairs for inclusion in this dataset. ## License 📜 This dataset is licensed under [CC BY 4.0](https://choosealicense.com/licenses/cc-by-4.0/) which allows for both non-commercial and commercial use of this dataset as long as appropriate attribution is made to it. ## Citation 🔖 If you use this dataset, please cite the [Massive Legal Embeddings Benchmark (MLEB)](https://arxiv.org/abs/2510.19365): ```bibtex @misc{butler2025massivelegalembeddingbenchmark, title={The Massive Legal Embedding Benchmark (MLEB)}, author={Umar Butler and Abdur-Rahman Butler and Adrian Lucas Malec}, year={2025}, eprint={2510.19365}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.19365}, } ```
提供机构:
isaacus
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作