alea-institute/kl3m-data-edgar-10-k
收藏Hugging Face2025-04-11 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/alea-institute/kl3m-data-edgar-10-k
下载链接
链接失效反馈官方服务:
资源简介:
KL3M数据集是ALEA研究所提供的一个大型语言模型训练资源,包含了超过1.32亿个经过版权和许可协议验证的文档和数万亿个token。该数据集适用于法律、金融和企业领域,并以Parquet文件格式存储,采用kl3m-004-128k-cased tokenizer进行token化。
The KL3M dataset is a large language model training resource provided by the ALEA Institute, containing over 132 million documents and trillions of tokens verified for copyright and licensing protocols. It is suitable for legal, financial, and enterprise domains, stored in Parquet file format, and tokenized using the kl3m-004-128k-cased tokenizer.
提供机构:
alea-institute



