five

ndp64/lex_files

收藏
Hugging Face2026-03-11 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ndp64/lex_files
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language_creators: - found language: - en license: - cc-by-nc-sa-4.0 multilinguality: - monolingual size_categories: - 1M<n<10M source_datasets: - extended task_categories: - text-generation - fill-mask task_ids: - language-modeling - masked-language-modeling pretty_name: LexFiles tags: - legal - law --- # Dataset Card for "LexFiles" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Dataset Specifications](#supported-tasks-and-leaderboards) ## Dataset Description - **Homepage:** https://github.com/coastalcph/lexlms - **Repository:** https://github.com/coastalcph/lexlms - **Paper:** https://arxiv.org/abs/2305.07507 - **Point of Contact:** [Ilias Chalkidis](mailto:ilias.chalkidis@di.ku.dk) ### Dataset Summary The LeXFiles is a new diverse English multinational legal corpus that we created including 11 distinct sub-corpora that cover legislation and case law from 6 primarily English-speaking legal systems (EU, CoE, Canada, US, UK, India). The corpus contains approx. 19 billion tokens. In comparison, the "Pile of Law" corpus released by Hendersons et al. (2022) comprises 32 billion in total, where the majority (26/30) of sub-corpora come from the United States of America (USA), hence the corpus as a whole is biased towards the US legal system in general, and the federal or state jurisdiction in particular, to a significant extent. ### Dataset Specifications | Corpus | Corpus alias | Documents | Tokens | Pct. | Sampl. (a=0.5) | Sampl. (a=0.2) | |-----------------------------------|----------------------|-----------|--------|--------|----------------|----------------| | EU Legislation | `eu-legislation` | 93.7K | 233.7M | 1.2% | 5.0% | 8.0% | | EU Court Decisions | `eu-court-cases` | 29.8K | 178.5M | 0.9% | 4.3% | 7.6% | | ECtHR Decisions | `ecthr-cases` | 12.5K | 78.5M | 0.4% | 2.9% | 6.5% | | UK Legislation | `uk-legislation` | 52.5K | 143.6M | 0.7% | 3.9% | 7.3% | | UK Court Decisions | `uk-court-cases` | 47K | 368.4M | 1.9% | 6.2% | 8.8% | | Indian Court Decisions | `indian-court-cases` | 34.8K | 111.6M | 0.6% | 3.4% | 6.9% | | Canadian Legislation | `canadian-legislation` | 6K | 33.5M | 0.2% | 1.9% | 5.5% | | Canadian Court Decisions | `canadian-court-cases` | 11.3K | 33.1M | 0.2% | 1.8% | 5.4% | | U.S. Court Decisions [1] | `us-court-cases` | 4.6M | 11.4B | 59.2% | 34.7% | 17.5% | | U.S. Legislation | `us-legislation` | 518 | 1.4B | 7.4% | 12.3% | 11.5% | | U.S. Contracts | `us-contracts` | 622K | 5.3B | 27.3% | 23.6% | 15.0% | | Total | `lexlms/lex_files` | 5.8M | 18.8B | 100% | 100% | 100% | [1] We consider only U.S. Court Decisions from 1965 onwards (cf. post Civil Rights Act), as a hard threshold for cases relying on severely out-dated and in many cases harmful law standards. The rest of the corpora include more recent documents. [2] Sampling (Sampl.) ratios are computed following the exponential sampling introduced by Lample et al. (2019). Additional corpora not considered for pre-training, since they do not represent factual legal knowledge. | Corpus | Corpus alias | Documents | Tokens | |----------------------------------------|------------------------|-----------|--------| | Legal web pages from C4 | `legal-c4` | 284K | 340M | ### Usage Load a specific sub-corpus, given the corpus alias, as presented above. ```python from datasets import load_dataset dataset = load_dataset('lexlms/lex_files', name='us-court-cases') ``` ### Citation [*Ilias Chalkidis\*, Nicolas Garneau\*, Catalina E.C. Goanta, Daniel Martin Katz, and Anders Søgaard.* *LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development.* *2022. In the Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada.*](https://aclanthology.org/2023.acl-long.865/) ``` @inproceedings{chalkidis-etal-2023-lexfiles, title = "{L}e{XF}iles and {L}egal{LAMA}: Facilitating {E}nglish Multinational Legal Language Model Development", author = "Chalkidis, Ilias and Garneau, Nicolas and Goanta, Catalina and Katz, Daniel and S{\o}gaard, Anders", booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.acl-long.865", pages = "15513--15535", } ```
提供机构:
ndp64
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作