mlfoundations/MINT-1T-ArXiv
收藏Hugging Face2024-09-19 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/mlfoundations/MINT-1T-ArXiv
下载链接
链接失效反馈官方服务:
资源简介:
MINT-1T是一个开源的多模态数据集,包含1万亿文本标记和34亿张图像,是现有开源数据集的10倍规模。该数据集旨在促进多模态预训练研究,涵盖了HTML、PDF和ArXiv文档等多种来源。数据集由华盛顿大学与Salesforce Research等机构合作创建,包含了从CommonCrawl WARC和WAT文件中提取的HTML和PDF文档,以及来自ArXiv存储库的论文。数据集经过严格的过滤和处理,以确保内容的相关性和质量,并尽可能减少个人和敏感信息的包含。
MINT-1T is an open-source multimodal interleaved dataset containing 1 trillion text tokens and 3.4 billion images, scaling up to 10 times the size of existing open-source datasets. The dataset includes previously untapped sources such as PDFs and ArXiv papers. MINT-1T is designed to facilitate research in multimodal pretraining and is created by a team from the University of Washington in collaboration with Salesforce Research and other academic institutions including Stanford University, University of Texas at Austin, and University of California Berkeley. The dataset comprises documents from various sources such as HTML, PDF, and ArXiv, and undergoes multi-step data collection and processing, including document extraction, filtering, image processing, and text processing. Despite efforts to minimize the inclusion of personal and sensitive information, users should be aware that the data may still contain such information. The dataset has potential biases, risks, and limitations, and users are advised to apply additional filtering based on specific use cases and be mindful of inappropriate use cases.
提供机构:
mlfoundations



