five

alea-institute/kl3m-data-uspto

收藏
Hugging Face2025-04-11 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/alea-institute/kl3m-data-uspto
下载链接
链接失效反馈
官方服务:
资源简介:
KL3M数据集是ALEA研究所提供的、用于大型语言模型训练的无版权问题训练资源,包含超过1.32亿个文档和数万亿个标记,涵盖16个不同来源。该数据集采用Parquet格式存储文档文本和元数据,并遵循CC BY 4.0许可。它包括获取和处理文档的源代码、原始文档及其元数据、标准化内容、预标记文档以及用于问答、摘要、转换、起草、分类、预测和对话的各种训练资源。

The KL3M dataset is a set of copyright-clean training resources provided by the ALEA Institute for the training of large language models, containing over 132 million documents and trillions of tokens from 16 different sources. The dataset uses the Parquet format to store document text and metadata, and is licensed under CC BY 4.0. It includes source code for acquiring and processing documents, original document formats with metadata, standardized content, pre-tokenized documents, and various training resources for question-answering, summarization, conversion, drafting, classification, prediction, and conversational applications.
提供机构:
alea-institute
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作