mlfoundations/MINT-1T-PDF-CC-2023-50
收藏Hugging Face2024-09-19 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/mlfoundations/MINT-1T-PDF-CC-2023-50
下载链接
链接失效反馈官方服务:
资源简介:
MINT-1T是一个开源的多模态交错数据集,包含1万亿文本标记和34亿图像,是现有开源数据集规模的10倍。数据集旨在促进多模态预训练研究,由华盛顿大学与Salesforce Research等机构合作创建。数据来源包括CommonCrawl的HTML文档、PDF文档和ArXiv论文。数据集经过了详细的过滤和处理,以确保内容的相关性和质量。尽管数据集主要来源于公共网络数据,但仍可能包含一些敏感或个人信息,用户在使用时应谨慎。
MINT-1T is an open-source multimodal interleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. It is designed to facilitate research in multimodal pretraining and was created by a team from the University of Washington in collaboration with Salesforce Research and other academic institutions. The dataset includes data from various sources such as HTML documents from CommonCrawl, PDF documents, and ArXiv papers. Extensive filtering and processing have been applied to ensure content relevance and quality. Despite originating from public web data, the dataset may still contain some sensitive or personal information, and users are advised to exercise caution.
提供机构:
mlfoundations



