mlfoundations/MINT-1T-HTML
收藏Hugging Face2024-09-21 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/mlfoundations/MINT-1T-HTML
下载链接
链接失效反馈官方服务:
资源简介:
MINT-1T是一个开源的多模态交错数据集,包含1万亿文本标记和34亿张图像,旨在促进多模态预训练研究。数据集由华盛顿大学与Salesforce Research等机构合作创建,涵盖了HTML、PDF和ArXiv文档等多种来源。数据集经过严格的过滤和处理,以确保内容的质量和安全性。尽管数据集主要面向研究,但用户在使用时需注意潜在的偏见、风险和伦理问题。
MINT-1T is an open-source multimodal dataset containing 1 trillion text tokens and 3.4 billion images, scaled 10x from existing open-source datasets. The dataset includes previously untapped sources such as PDFs and ArXiv papers. MINT-1T is designed to facilitate research in multimodal pretraining and was created by a team from the University of Washington in collaboration with Salesforce Research and other academic institutions. The dataset comprises multimodal documents from various sources, including HTML documents, PDF documents, and ArXiv documents. The creation process involved data collection, filtering, and processing steps, along with identified potential biases, risks, and limitations. MINT-1T is released under a CC-BY-4.0 license and is intended primarily for research purposes.
提供机构:
mlfoundations



