Pile of Law
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/breakend/pileoflaw
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是由多个数据集组合而成,专门用于对DALE进行预训练,主要涉及法律文件。此外,该数据集还用于无监督的文本去噪和法律语言建模任务。其规模大约包含410万个文档,总数据量约为48GB。该数据集的任务是针对生成式数据增强的预训练。
This dataset is compiled from multiple datasets, specifically designed for pre-training DALE and primarily focusing on legal documents. Additionally, it is also utilized for unsupervised text denoising and legal language modeling tasks. It consists of approximately 4.1 million documents, with a total data volume of around 48 GB. This dataset is intended for pre-training oriented towards generative data augmentation.



