isaacus/mleb-legal-rag-bench

Name: isaacus/mleb-legal-rag-bench
Creator: isaacus
Published: 2026-02-20 02:53:23
License: 暂无描述

Hugging Face2026-02-20 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/isaacus/mleb-legal-rag-bench

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: Legal RAG Bench (MLEB version) task_categories: - text-retrieval - question-answering - text-ranking tags: - legal - law - australia source_datasets: - isaacus/legal-rag-bench language: - en language_details: en-AU annotations_creators: - expert-generated - found language_creators: - expert-generated - found license: cc-by-nc-4.0 size_categories: - 1K<n<10K dataset_info: - config_name: default features: - name: query-id dtype: string - name: corpus-id dtype: string - name: score dtype: float64 splits: - name: test num_examples: 100 - config_name: corpus features: - name: _id dtype: string - name: title dtype: string - name: text dtype: string splits: - name: corpus num_examples: 4876 - config_name: queries features: - name: _id dtype: string - name: text dtype: string splits: - name: queries num_examples: 100 configs: - config_name: default data_files: - split: test path: default.jsonl - config_name: corpus data_files: - split: corpus path: corpus.jsonl - config_name: queries data_files: - split: queries path: queries.jsonl --- # Legal RAG Bench (MLEB version) This is the version of the [Legal RAG Bench](https://huggingface.co/datasets/isaacus/legal-rag-bench) evaluation dataset used in the [Massive Legal Embedding Benchmark (MLEB)](https://isaacus.com/mleb) by [Isaacus](https://isaacus.com/). This dataset tests the ability of information retrieval models to retrieve relevant passages to complex, meaningfully challenging, reasoning-intensive questions about Victorian criminal law. If you are looking for Legal RAG Bench proper, you may find it [here](https://huggingface.co/datasets/isaacus/legal-rag-bench). ## Structure 🗂️ As per the MTEB information retrieval dataset format, this dataset comprises three splits: `default`, `corpus` and `queries`. The `default` split pairs questions (`query-id`) with relevant passages (`corpus-id`), each pair having a `score` of 1. The `corpus` split contains passages, with the text of a passage being stored in the `text` key and its id being stored in the `_id` key. There is also a `title` column, which is deliberately set to an empty string in all cases for compatibility with the [`mteb`](https://github.com/embeddings-benchmark/mteb) library. The `queries` split contains questions, with the text of a question being stored in the `text` key and its id being stored in the `_id` key. ## Methodology 🧪 To understand how Legal RAG Bench itself was created, refer to its [documentation](https://huggingface.co/datasets/isaacus/legal-rag-bench). This dataset was formatted by taking the test split of Legal RAG Bench, treating questions as anchors and relevant passages as positive passages, and adding irrelevant passages to the global passage corpus. ## License 📜 This dataset is licensed under [CC BY NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/deed.en) which allows for non-commercial use of this dataset as long as appropriate attribution is made to it. ## Citation 🔖 If you use this dataset, please cite the [Massive Legal Embeddings Benchmark (MLEB)](https://arxiv.org/abs/2510.19365) as well. A preprint for Legal RAG Bench will be published before 27 February 2026 at the latest. ``` @misc{butler2026legalragbench, title={Legal RAG Bench}, author={Abdur-Rahman Butler and Umar Butler}, year={2026}, note={Preprint forthcoming.} } @misc{butler2025massivelegalembeddingbenchmark, title={The Massive Legal Embedding Benchmark (MLEB)}, author={Umar Butler and Abdur-Rahman Butler and Adrian Lucas Malec}, year={2025}, eprint={2510.19365}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.19365}, } ```

提供机构：

isaacus

5,000+

优质数据集

54 个

任务类型

进入经典数据集