Pile of Law

arXiv2025-09-30 收录

下载链接：

https://github.com/breakend/pileoflaw

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是由多个数据集组合而成，专门用于对DALE进行预训练，主要涉及法律文件。此外，该数据集还用于无监督的文本去噪和法律语言建模任务。其规模大约包含410万个文档，总数据量约为48GB。该数据集的任务是针对生成式数据增强的预训练。

This dataset is compiled from multiple datasets, specifically designed for pre-training DALE and primarily focusing on legal documents. Additionally, it is also utilized for unsupervised text denoising and legal language modeling tasks. It consists of approximately 4.1 million documents, with a total data volume of around 48 GB. This dataset is intended for pre-training oriented towards generative data augmentation.

5,000+

优质数据集

54 个

任务类型

进入经典数据集