alea-institute/kl3m-data-sample-004-shuffled

Name: alea-institute/kl3m-data-sample-004-shuffled
Creator: alea-institute
Published: 2025-11-11 01:04:13
License: 暂无描述

Hugging Face2025-11-11 更新2025-11-15 收录

下载链接：

https://hf-mirror.com/datasets/alea-institute/kl3m-data-sample-004-shuffled

下载链接

链接失效反馈

官方服务：

资源简介：

KL3M数据样本004（随机排序）：该数据集包含从KL3M数据项目中随机抽取的1000万个样本，适用于法律、监管和政府领域的语言模型训练。数据集包括来自权威来源的压缩文档，如法庭意见、政府监管材料、公司文件、知识产权记录、立法文本和一般政府出版物，总计约28TB。数据集特征包括每个示例的唯一标识符、源内容的MIME类型和文本内容。数据集分为训练集，包含10万个样本。数据集未压缩大小约为460GB，压缩后下载大小约为182GB，采用Parquet文件格式存储。

KL3M Data Sample 004 (Shuffled): This dataset contains a shuffled sample of 10 million examples from the KL3M Data Project, suitable for training language models in legal, regulatory, and government domains. The dataset includes compressed documents from authoritative sources such as court opinions, government regulatory materials, corporate filings, intellectual property records, legislative texts, and general government publications, totaling approximately 28 TB. The dataset features include a unique identifier for each example, the MIME type of the source content, and the text content. The dataset is split into a training set with 100,000 examples. The dataset size is approximately 460GB (uncompressed) and 182GB (compressed), stored in Parquet file format.

提供机构：

alea-institute

5,000+

优质数据集

54 个

任务类型

进入经典数据集