mlfoundations/MINT-1T-PDF-CC-2023-40

Name: mlfoundations/MINT-1T-PDF-CC-2023-40
Creator: mlfoundations
Published: 2024-09-19 21:06:59
License: 暂无描述

Hugging Face2024-09-19 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/mlfoundations/MINT-1T-PDF-CC-2023-40

下载链接

链接失效反馈

官方服务：

资源简介：

MINT-1T是一个开源的多模态交错数据集，包含1万亿文本标记和34亿张图像，规模是现有开源数据集的10倍。数据集包括PDF、HTML和ArXiv论文等以前未被充分利用的资源。数据集的创建旨在促进多模态预训练研究，由华盛顿大学与Salesforce Research及其他学术机构合作创建。数据集的来源包括CommonCrawl的HTML和PDF文档以及ArXiv的论文。数据处理过程包括文档提取、过滤、图像处理、文本处理和PDF特定处理等步骤，以确保数据的质量和相关性。数据集的使用范围明确，适用于多模态模型的预训练，但不推荐用于处理或生成个人身份信息或军事应用。数据集的潜在风险包括数据偏差、内容风险、图像可用性和PDF解析限制等。

MINT-1T is an open-source multimodal interleaved dataset containing 1 trillion text tokens and 3.4 billion images, scaling up by 10x from existing open-source datasets. The dataset includes previously untapped sources such as PDFs, HTML, and ArXiv papers. Created to facilitate research in multimodal pretraining, MINT-1T was developed by a team from the University of Washington in collaboration with Salesforce Research and other academic institutions. The dataset sources include HTML and PDF documents from CommonCrawl and papers from ArXiv. The data processing involves document extraction, filtering, image processing, text processing, and PDF-specific processing to ensure data quality and relevance. The dataset is suitable for pretraining multimodal models but not recommended for processing or generating personally identifiable information or military applications. Potential risks include data bias, content risks, image availability, and PDF parsing limitations.

提供机构：

mlfoundations

5,000+

优质数据集

54 个

任务类型

进入经典数据集