alea-institute/kl3m-data-govinfo-hob
收藏Hugging Face2025-04-11 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/alea-institute/kl3m-data-govinfo-hob
下载链接
链接失效反馈官方服务:
资源简介:
KL3M数据项目是ALEA研究所的一个项目,提供了版权清洁的大型语言模型训练资源。这个数据集包含了超过1.32亿个文档和数万亿个标记,来自16个不同的经过验证的版权和许可协议的来源。数据集以Parquet文件格式存储,包含文档文本和元数据,并使用CC BY 4.0许可证。数据集还包括了原始文档格式、标准化格式的提取内容、预分词的文档表示以及各种训练资源和工具。
The KL3M Data Project, by the ALEA Institute, provides copyright-clean training resources for large language models. This dataset comprises over 132 million documents and trillions of tokens from 16 different sources verified to comply with strict copyright and licensing protocols. The dataset is stored in Parquet file format, containing document text and metadata, and is licensed under CC BY 4.0. It also includes original document formats, standardized extracted content, pre-tokenized document representations, and various training resources and tools.
提供机构:
alea-institute



