five

MINT-1T-ArXiv

收藏
魔搭社区2025-12-05 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/mlfoundations/MINT-1T-ArXiv
下载链接
链接失效反馈
官方服务:
资源简介:
<h1 align="center"> 🍃 MINT-1T:<br>Scaling Open-Source Multimodal Data by 10x:<br> A Multimodal Dataset with One Trillion Tokens </h1> 🍃 MINT-1T is an open-source **M**ultimodal **INT**erleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in collaboration with Salesforce Research, other academic institutions including Stanford University, University of Texas at Austin, and University of California Berkeley. You are currently viewing the ArXiv subset of 🍃 MINT-1T. For HTML and PDF subsets, please refer to the [🍃 MINT-1T collection](https://huggingface.co/collections/mlfoundations/mint-1t-6690216ca4d0df7e518dde1c). ![Examples](interleaved-example-twitter.png) ## Dataset Details ### Dataset Sources - **Repository**: https://github.com/mlfoundations/MINT-1T - **Paper:** https://arxiv.org/abs/2406.11271 - **Blog:** https://blog.salesforceairesearch.com/mint-1t/ ## Uses ### Direct Use <!-- This section describes suitable use cases for the dataset. --> 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. The dataset can be used for training multimodal models that can reson about interleaved text and images sequences such as [Idefics2](https://huggingface.co/HuggingFaceM4/idefics2-8b), [XGen-MM](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-r-v1), and [Chameleon](https://huggingface.co/facebook/chameleon-30b). ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> 🍃 MINT-1T was built to make research into large multimodal models more accessible. Using the dataset to train models that ingest or generate personally identifying information (such as images of people’s faces and other sensitive content) as well as military applications are all inappropriate use cases of 🍃 MINT-1T. ## Dataset Creation ### Curation Rationale 🍃 MINT-1T was created to address a significant gap in the open-source domain by providing a large-scale multimodal interleaved dataset for pre-training large multimodal models. This dataset aims to be a valuable resource for the research community, facilitating open science in multimodal pretraining. ### Source Data The dataset is a comprehensive collection of multimodal documents from various sources: - HTML documents: Filtered from CommonCrawl WARC dumps spanning from 2017 to 2024 - PDF documents: Extracted from CommonCrawl WAT dumps covering 2023 to 2024 - ArXiv documents: A subset of papers from the ArXiv repository In total, 🍃 MINT-1T contains 1056.8 million documents, broken down as follows: - 1029.4 million HTML documents - 24.0 million PDF documents - 0.6 million ArXiv documents #### Data Collection and Processing The data collection and processing involved several steps: 1. Document Extraction: - HTML documents were parsed from CommonCrawl WARC files - PDF documents were extracted from CommonCrawl WAT files - ArXiv papers were directly sourced from ArXiv S3 buckets 2. Filtering Process: - Applied text quality filters to ensure content relevance and readability - Removed duplicate content at both paragraph and document levels - Filtered out undesirable content based on predefined criteria - Verified image availability and quality for HTML documents - Limited PDF size to 50MB and 50 pages to manage dataset size and quality 3. Image Processing: - Used NSFW image detection to remove pornographic or otherwise undesirable images - Removed images smaller than 150 pixels or larger than 20,000 pixels - Adjusted aspect ratio thresholds for HTML (2:1) and PDF (3:1) to preserve scientific figures 4. Text Processing: - Used fasttext for language identification, focusing on English content - Masked personally identifiable information such as email addresses and IP addresses - Applied paragraph and document-level deduplication using Bloom filters 5. PDF Specific Processing: - Used PyMuPDF for parsing PDFs and extracting reading order - Clustered text blocks based on columns and ordered from top left to bottom right 6. ArXiv Specific Processing: - Used TexSoup to parse LaTeX source code and interleave images with text - Cleaned up LaTeX code by removing imports, bibliography, tables, and citation tags Various open-source tools were utilized in this process, including fasttext, [PyMuPDF](https://github.com/pymupdf/PyMuPDF), and [DCLM](https://www.datacomp.ai/dclm/) and [bff](https://github.com/revbucket/bff) for deduplication and content filtering. #### Personal and Sensitive Information Despite sourcing from public web data, significant efforts were made to minimize the inclusion of personal and sensitive information: - Email addresses and IP addresses were masked to protect privacy - An NSFW image classifierto remove inappropriate visual content - URLs containing substrings associated with undesirable or sensitive content were filtered out However, users should be aware that as the data originates from the public web, it may still contain some sensitive or personal information. The dataset creators acknowledge this limitation and advise users to exercise caution and potentially apply additional filtering based on their specific use cases. ## Bias, Risks, and Limitations Several potential biases, risks, and limitations have been identified: 1. Data Bias: As the dataset is sourced from web crawls, it may inherit biases present in online content. 2. Content Risks: Despite extensive filtering, there's a possibility that some offensive, insensitive, or inappropriate content may remain in the dataset. 3. Image Availability: The dataset relies on external image URLs, which may become unavailable over time due to link rot, potentially affecting the dataset's long-term usability. 4. PDF Parsing Limitations: The current method for extracting reading order from PDFs may not always accurately capture the intended flow, especially for documents with complex layouts. 5. Potential Legal and Ethical Concerns: While efforts were made to respect robots.txt files and remove sensitive information, there may still be content that individuals did not explicitly consent to include. ### Recommendations Given these considerations, the following recommendations are provided: 1. Additional Filtering: Users are strongly encouraged to apply additional filtering based on their specific use case and ethical considerations. 2. Inappropriate Use Cases: The dataset is not recommended for applications involving the processing or generation of personally identifying information, nor for military applications. 3. Legal Compliance: Users should independently verify compliance with applicable laws before employing MINT-1T for commercial purposes. 4. Bias Awareness: Researchers and developers should be cognizant of potential biases in the dataset and consider their impact on model training and outputs. ## License We release 🍃 MINT-1T under a CC-BY-4.0 license, designating it primarily as a research artifact. While the dataset is freely available, users are responsible for ensuring its legal use in commercial settings. Users must independently verify compliance with applicable laws before employing MINT-1T for commercial purposes. ## Citation ``` @article{awadalla2024mint1t, title={MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens}, author={Anas Awadalla and Le Xue and Oscar Lo and Manli Shu and Hannah Lee and Etash Kumar Guha and Matt Jordan and Sheng Shen and Mohamed Awadalla and Silvio Savarese and Caiming Xiong and Ran Xu and Yejin Choi and Ludwig Schmidt}, year={2024} } ```

🍃 MINT-1T:将开源多模态数据扩容10倍:拥有万亿Token (Token)的多模态数据集 🍃 MINT-1T 是一款开源的**多模态交错(Multimodal INTerleaved)**数据集,包含1万亿文本Token (Token) 与34亿张图像,规模较现有开源数据集扩容10倍。此外,该数据集还纳入了此前未被充分利用的数据源,如PDF文档与ArXiv论文。 🍃 MINT-1T 的设计目标是推动多模态预训练领域的研究。该数据集由华盛顿大学团队与Salesforce Research合作打造,同时联合了斯坦福大学、德克萨斯大学奥斯汀分校、加州大学伯克利分校等多所学术机构。 您当前正在浏览🍃 MINT-1T 的ArXiv子集。若需获取HTML与PDF子集,请参阅[🍃 MINT-1T 数据集集合](https://huggingface.co/collections/mlfoundations/mint-1t-6690216ca4d0df7e518dde1c)。 ![示例](interleaved-example-twitter.png) ## 数据集详情 ### 数据集来源 - **代码仓库**:https://github.com/mlfoundations/MINT-1T - **论文**:https://arxiv.org/abs/2406.11271 - **博客**:https://blog.salesforceairesearch.com/mint-1t/ ## 使用场景 ### 直接使用场景 🍃 MINT-1T 的设计目标是推动多模态预训练领域的研究。本数据集可用于训练能够对文本与图像交错序列进行推理的多模态模型,例如[Idefics2](https://huggingface.co/HuggingFaceM4/idefics2-8b)、[XGen-MM](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-r-v1) 以及[Chameleon](https://huggingface.co/facebook/chameleon-30b)。 ### 禁止使用场景 🍃 MINT-1T 的开发初衷是降低大型多模态模型的研究门槛。但若将本数据集用于训练可提取或生成个人身份信息(如人脸图像及其他敏感内容)的模型,或是用于军事用途,则均属于不当使用场景。 ## 数据集构建 ### 筛选逻辑 🍃 MINT-1T 的构建旨在填补开源领域的一项重要空白:为大型多模态模型的预训练提供大规模多模态交错数据集。本数据集致力于成为研究社区的宝贵资源,推动多模态预训练领域的开放科学发展。 ### 源数据 本数据集是来自多源的多模态文档的综合集合,具体来源如下: - HTML文档:从2017年至2024年的CommonCrawl WARC存档中筛选得到 - PDF文档:从2023年至2024年的CommonCrawl WAT存档中提取得到 - ArXiv文档:取自ArXiv仓库的论文子集 🍃 MINT-1T 总计包含10.568亿份文档,细分如下: - 10.294亿份HTML文档 - 0.24亿份PDF文档 - 0.006亿份ArXiv文档 #### 数据收集与处理 本流程包含以下多个步骤: 1. **文档提取**: - 从CommonCrawl WARC文件中解析HTML文档 - 从CommonCrawl WAT文件中提取PDF文档 - 直接从ArXiv的S3存储桶获取ArXiv论文 2. **筛选流程**: - 应用文本质量筛选规则,确保内容相关性与可读性 - 在段落与文档层级移除重复内容 - 根据预设标准过滤不当内容 - 验证HTML文档中图像的可用性与质量 - 将PDF文档的大小限制为50MB、页数限制为50页,以管控数据集规模与质量 3. **图像处理**: - 使用NSFW图像检测模型移除色情或其他不当图像 - 移除尺寸小于150像素或大于20000像素的图像 - 针对HTML文档(宽高比阈值2:1)与PDF文档(宽高比阈值3:1)调整宽高比限制,以保留科学图表 4. **文本处理**: - 使用fasttext进行语言识别,优先保留英文内容 - 对邮箱地址、IP地址等个人身份信息进行掩码处理 - 使用布隆过滤器(Bloom filter)在段落与文档层级实现去重 5. **PDF专属处理**: - 使用PyMuPDF解析PDF文档并提取阅读顺序 - 根据文本块的列进行聚类,并按照从左上到右下的顺序排序 6. **ArXiv专属处理**: - 使用TexSoup解析LaTeX源代码,并将图像与文本交错排列 - 清理LaTeX代码,移除导入语句、参考文献、表格与引用标签 本流程使用了多款开源工具,包括fasttext、[PyMuPDF](https://github.com/pymupdf/PyMuPDF),以及用于去重与内容筛选的[DCLM](https://www.datacomp.ai/dclm/)和[bff](https://github.com/revbucket/bff)。 #### 个人与敏感信息 尽管本数据集源自公开网络数据,团队仍付出了大量努力以最大限度减少个人与敏感信息的收录: - 对邮箱地址与IP地址进行掩码处理以保护隐私 - 使用NSFW图像分类器移除不当视觉内容 - 过滤包含与不当或敏感内容相关子字符串的URL 但用户需注意,由于数据源自公开网络,数据集仍可能包含少量敏感或个人信息。数据集构建团队已意识到该局限性,并建议用户谨慎使用,且可根据自身使用场景额外添加筛选步骤。 ## 偏差、风险与局限性 本数据集已被识别出若干潜在偏差、风险与局限性: 1. **数据偏差**:由于数据集源自网络爬取数据,可能继承网络内容中存在的各类偏差。 2. **内容风险**:尽管已进行多轮筛选,数据集仍可能残留部分冒犯性、不敏感或不当内容。 3. **图像可用性风险**:数据集依赖外部图像URL,随着时间推移可能因链接失效导致图像无法访问,进而影响数据集的长期可用性。 4. **PDF解析局限性**:当前从PDF中提取阅读顺序的方法未必总能准确还原文档的预期阅读逻辑,尤其针对布局复杂的文档。 5. **潜在法律与伦理问题**:尽管团队已尽力遵守robots.txt协议并移除敏感信息,数据集仍可能包含未获得个人明确同意收录的内容。 ### 建议 基于上述考量,特提出以下建议: 1. **额外筛选**:强烈建议用户根据自身使用场景与伦理考量,额外添加筛选步骤。 2. **不当使用场景**:本数据集不建议用于处理或生成个人身份信息的应用,亦不建议用于军事用途。 3. **法律合规**:用户在将MINT-1T用于商业用途前,应自行验证是否符合相关法律法规。 4. **偏差认知**:研究人员与开发者应意识到数据集可能存在的偏差,并考量其对模型训练与输出的影响。 ## 许可证 我们采用CC-BY-4.0许可证发布🍃 MINT-1T,将其主要定位为研究用数据集。尽管本数据集可免费获取,但用户需自行确保其在商业场景中的使用合法。用户在将MINT-1T用于商业用途前,应自行验证是否符合相关法律法规。 ## 引用格式 @article{awadalla2024mint1t, title={MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens}, author={Anas Awadalla and Le Xue and Oscar Lo and Manli Shu and Hannah Lee and Etash Kumar Guha and Matt Jordan and Sheng Shen and Mohamed Awadalla and Silvio Savarese and Caiming Xiong and Ran Xu and Yejin Choi and Ludwig Schmidt}, year={2024} }
提供机构:
maas
创建时间:
2025-10-03
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作