alea-institute/kl3m-data-cap
收藏Hugging Face2025-04-11 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/alea-institute/kl3m-data-cap
下载链接
链接失效反馈官方服务:
资源简介:
KL3M数据集是ALEA研究所提供的用于大型语言模型训练的版权清洁资源,包含超过1.32亿份文档和数万亿个标记,涵盖了16个不同来源。这些资源均符合项目详细说明的严格的版权和许可协议。数据集以Parquet文件格式存储,使用了针对法律、金融和企业文档优化的kl3m-004-128k-cased标记器。
The KL3M Dataset is a set of copyright-clean training resources provided by the ALEA Institute for large language models, containing over 132 million documents and trillions of tokens from 16 different sources. These resources comply with the strict copyright and licensing protocols detailed in the project. The dataset is stored in Parquet file format and uses the kl3m-004-128k-cased tokenizer optimized for legal, financial, and enterprise documents.
提供机构:
alea-institute



