MINT-1T-HTML
收藏魔搭社区2026-01-08 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/MINT-1T-HTML
下载链接
链接失效反馈官方服务:
资源简介:
<h1 align="center">
🍃 MINT-1T:<br>Scaling Open-Source Multimodal Data by 10x:<br> A Multimodal Dataset with One Trillion Tokens
</h1>
🍃 MINT-1T is an open-source **M**ultimodal **INT**erleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in collaboration with Salesforce Research, other academic institutions including Stanford University, University of Texas at Austin, and University of California Berkeley.
You are currently viewing the HTML subset of 🍃 MINT-1T. For PDF and ArXiv subsets, please refer to the [🍃 MINT-1T collection](https://huggingface.co/collections/mlfoundations/mint-1t-6690216ca4d0df7e518dde1c).

## Updates
### 9/7/24
We have improved MINT-1T (HTML) by removing boilerplate from the header and footer of each document. This new version of the data can be found in directory `data_v1_1` and contains 742B text tokens. The previous version of the data can be found in directory `data_v1_0`.
### 8/8/24
We have updated MINT-1T (HTML) with fixed document URL filtering and additional image safety filtering. As we prioritize safety, we have decided to only release the HTML data from MINT-1T that passes a rigorous image filtering pipeline; we run an additional image safety classifier, the one created by [Datacomp](https://www.datacomp.ai/dcclip/index.html#home), on data already filtered by our [original NSFW image classifier](https://github.com/GantMan/nsfw_model). The newly released MINT-1T (HTML) contains 792B text tokens and 905M documents.
## Dataset Details
### Dataset Sources
- **Repository**: https://github.com/mlfoundations/MINT-1T
- **Paper:** https://arxiv.org/abs/2406.11271
- **Blog:** https://blog.salesforceairesearch.com/mint-1t/
## Uses
### Direct Use
<!-- This section describes suitable use cases for the dataset. -->
🍃 MINT-1T is designed to facilitate research in multimodal pretraining. The dataset can be used for training multimodal models that can reson about interleaved text and images sequences such as [Idefics2](https://huggingface.co/HuggingFaceM4/idefics2-8b), [XGen-MM](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-r-v1), and [Chameleon](https://huggingface.co/facebook/chameleon-30b).
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. -->
🍃 MINT-1T was built to make research into large multimodal models more accessible. Using
the dataset to train models that ingest or generate personally identifying information (such
as images of people’s faces and other sensitive content) as well as military applications are all inappropriate use cases of 🍃 MINT-1T.
## Dataset Creation
### Curation Rationale
🍃 MINT-1T was created to address a significant gap in the open-source domain by providing a large-scale multimodal interleaved dataset for pre-training large multimodal models. This dataset aims to be a valuable resource for the research community, facilitating open science in multimodal pretraining.
### Source Data
The dataset is a comprehensive collection of multimodal documents from various sources:
- HTML documents: Filtered from CommonCrawl WARC dumps spanning from 2017 to 2024
- PDF documents: Extracted from CommonCrawl WAT dumps covering 2023 to 2024
- ArXiv documents: A subset of papers from the ArXiv repository
In total, 🍃 MINT-1T contains 1056.8 million documents, broken down as follows:
- 1029.4 million HTML documents
- 24.0 million PDF documents
- 0.6 million ArXiv documents
#### Data Collection and Processing
The data collection and processing involved several steps:
1. Document Extraction:
- HTML documents were parsed from CommonCrawl WARC files
- PDF documents were extracted from CommonCrawl WAT files
- ArXiv papers were directly sourced from ArXiv S3 buckets
2. Filtering Process:
- Applied text quality filters to ensure content relevance and readability
- Removed duplicate content at both paragraph and document levels
- Filtered out undesirable content based on predefined criteria
- Verified image availability and quality for HTML documents
- Limited PDF size to 50MB and 50 pages to manage dataset size and quality
3. Image Processing:
- Used NSFW image detection to remove pornographic or otherwise undesirable images
- Removed images smaller than 150 pixels or larger than 20,000 pixels
- Adjusted aspect ratio thresholds for HTML (2:1) and PDF (3:1) to preserve scientific figures
4. Text Processing:
- Used fasttext for language identification, focusing on English content
- Masked personally identifiable information such as email addresses and IP addresses
- Applied paragraph and document-level deduplication using Bloom filters
5. PDF Specific Processing:
- Used PyMuPDF for parsing PDFs and extracting reading order
- Clustered text blocks based on columns and ordered from top left to bottom right
6. ArXiv Specific Processing:
- Used TexSoup to parse LaTeX source code and interleave images with text
- Cleaned up LaTeX code by removing imports, bibliography, tables, and citation tags
Various open-source tools were utilized in this process, including fasttext, [PyMuPDF](https://github.com/pymupdf/PyMuPDF), and [DCLM](https://www.datacomp.ai/dclm/) and [bff](https://github.com/revbucket/bff) for deduplication and content filtering.
#### Personal and Sensitive Information
Despite sourcing from public web data, significant efforts were made to minimize the inclusion of personal and sensitive information:
- Email addresses and IP addresses were masked to protect privacy
- An NSFW image classifierto remove inappropriate visual content
- URLs containing substrings associated with undesirable or sensitive content were filtered out
However, users should be aware that as the data originates from the public web, it may still contain some sensitive or personal information. The dataset creators acknowledge this limitation and advise users to exercise caution and potentially apply additional filtering based on their specific use cases.
## Bias, Risks, and Limitations
Several potential biases, risks, and limitations have been identified:
1. Data Bias: As the dataset is sourced from web crawls, it may inherit biases present in online content.
2. Content Risks: Despite extensive filtering, there's a possibility that some offensive, insensitive, or inappropriate content may remain in the dataset.
3. Image Availability: The dataset relies on external image URLs, which may become unavailable over time due to link rot, potentially affecting the dataset's long-term usability.
4. PDF Parsing Limitations: The current method for extracting reading order from PDFs may not always accurately capture the intended flow, especially for documents with complex layouts.
5. Potential Legal and Ethical Concerns: While efforts were made to respect robots.txt files and remove sensitive information, there may still be content that individuals did not explicitly consent to include.
### Recommendations
Given these considerations, the following recommendations are provided:
1. Additional Filtering: Users are strongly encouraged to apply additional filtering based on their specific use case and ethical considerations.
2. Inappropriate Use Cases: The dataset is not recommended for applications involving the processing or generation of personally identifying information, nor for military applications.
3. Legal Compliance: Users should independently verify compliance with applicable laws before employing MINT-1T for commercial purposes.
4. Bias Awareness: Researchers and developers should be cognizant of potential biases in the dataset and consider their impact on model training and outputs.
## License
We release 🍃 MINT-1T under a CC-BY-4.0 license, designating it primarily as a research artifact. While the dataset is freely available, users are responsible for ensuring its legal use in commercial settings. Users must independently verify compliance with applicable laws before employing MINT-1T for commercial purposes.
## Citation
```
@article{awadalla2024mint1t,
title={MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens},
author={Anas Awadalla and Le Xue and Oscar Lo and Manli Shu and Hannah Lee and Etash Kumar Guha and Matt Jordan and Sheng Shen and Mohamed Awadalla and Silvio Savarese and Caiming Xiong and Ran Xu and Yejin Choi and Ludwig Schmidt},
year={2024}
}
```
<h1 align="center">
🍃 MINT-1T:<br>将开源多模态数据扩容10倍:<br>拥有万亿Token(Token)的多模态数据集
</h1>
🍃 MINT-1T 是一款开源的**多模态交错(Multimodal Interleaved)**数据集,包含1万亿文本Token(Token)与34亿张图像,规模较现有开源数据集扩容10倍。此外,本数据集还纳入了此前未被充分利用的数据源,如PDF文档与ArXiv论文。🍃 MINT-1T 旨在推动多模态预训练领域的研究。本数据集由华盛顿大学团队与Salesforce Research合作打造,同时联合了斯坦福大学、德克萨斯大学奥斯汀分校、加州大学伯克利分校等其他学术机构共同参与。
您当前浏览的是🍃 MINT-1T 的HTML子集。如需获取PDF与ArXiv子集,请访问[🍃 MINT-1T 数据集集合](https://huggingface.co/collections/mlfoundations/mint-1t-6690216ca4d0df7e518dde1c)。

## 更新日志
### 2024年9月7日
我们对🍃 MINT-1T(HTML子集)进行了优化,移除了每份文档页眉与页脚的模板化文本。该新版数据集存储于`data_v1_1`目录,包含7420亿文本Token。旧版数据集存储于`data_v1_0`目录。
### 2024年8月8日
我们对🍃 MINT-1T(HTML子集)进行了更新,修复了文档URL过滤逻辑,并新增了图像安全过滤流程。鉴于我们将安全性置于优先地位,仅会发布通过严格图像过滤流程的🍃 MINT-1T HTML数据:我们在原始NSFW(不适宜工作内容,Not Safe For Work)图像分类器(由[nsfw_model](https://github.com/GantMan/nsfw_model)开发)过滤后的数据集基础上,额外运行了[Datacomp](https://www.datacomp.ai/dcclip/index.html#home)开发的图像安全分类器。本次发布的HTML子集包含7920亿文本Token与9.05亿份文档。
## 数据集详情
### 数据集来源
- **仓库地址**:https://github.com/mlfoundations/MINT-1T
- **论文链接**:https://arxiv.org/abs/2406.11271
- **博客链接**:https://blog.salesforceairesearch.com/mint-1t/
## 使用场景
### 直接使用场景
<!-- 本部分描述了本数据集的适用场景。 -->
🍃 MINT-1T 旨在推动多模态预训练领域的研究。本数据集可用于训练能够理解交错文本与图像序列的多模态模型,例如[Idefics2](https://huggingface.co/HuggingFaceM4/idefics2-8b)、[XGen-MM](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-r-v1)以及[Chameleon](https://huggingface.co/facebook/chameleon-30b)。
### 不适用场景
<!-- 本部分阐述了不当使用、恶意使用以及本数据集无法良好适配的使用场景。 -->
🍃 MINT-1T 的构建初衷是降低大型多模态模型的研究门槛。使用本数据集训练用于处理或生成个人身份信息(如人脸图像与其他敏感内容)的模型,以及将其用于军事用途,均属于本数据集的不当使用场景。
## 数据集构建
### 构建初衷
🍃 MINT-1T 的开发旨在填补开源领域的一项重大空白:为大型多模态模型的预训练提供大规模多模态交错数据集。本数据集旨在为研究社区提供宝贵资源,推动多模态预训练领域的开放科学研究。
### 源数据
本数据集是来自多种来源的多模态文档的综合集合:
- HTML文档:从2017年至2024年的CommonCrawl WARC归档文件中筛选得到
- PDF文档:从2023年至2024年的CommonCrawl WAT归档文件中提取得到
- ArXiv文档:ArXiv仓库中论文的子集
总体而言,🍃 MINT-1T 包含10.568亿份文档,细分如下:
- 10.294亿份HTML文档
- 2400万份PDF文档
- 60万份ArXiv文档
#### 数据收集与处理流程
数据收集与处理包含以下多个步骤:
1. 文档提取:
- 从CommonCrawl WARC文件中解析得到HTML文档
- 从CommonCrawl WAT文件中提取得到PDF文档
- 直接从ArXiv S3存储桶中获取ArXiv论文
2. 过滤流程:
- 应用文本质量过滤以确保内容相关性与可读性
- 在段落与文档级别移除重复内容
- 根据预定义标准过滤掉不当内容
- 验证HTML文档的图像可用性与质量
- 限制PDF文件大小不超过50MB且页数不超过50页,以管控数据集规模与质量
3. 图像处理:
- 使用NSFW图像检测移除色情或其他不当图像
- 移除尺寸小于150像素或大于20000像素的图像
- 为HTML文档(宽高比阈值2:1)与PDF文档(宽高比阈值3:1)调整宽高比限制,以保留科学图表
4. 文本处理:
- 使用fasttext进行语言识别,仅保留英文内容
- 对电子邮件地址与IP地址等个人身份信息进行掩码处理
- 使用Bloom过滤器在段落与文档级别进行去重
5. PDF专属处理:
- 使用PyMuPDF解析PDF并提取阅读顺序
- 根据列对文本块进行聚类,并按从左上到右下的顺序排序
6. ArXiv专属处理:
- 使用TexSoup解析LaTeX源代码,并将图像与文本交错排列
- 清理LaTeX代码,移除导入语句、参考文献、表格与引用标签
本流程使用了多种开源工具,包括fasttext、[PyMuPDF](https://github.com/pymupdf/PyMuPDF)、用于去重与内容过滤的[DCLM](https://www.datacomp.ai/dclm/)以及[bff](https://github.com/revbucket/bff)。
#### 个人与敏感信息处理
尽管本数据集源自公开网络数据,我们仍付出了大量努力以减少个人与敏感信息的收录:
- 对电子邮件地址与IP地址进行掩码处理以保护隐私
- 使用NSFW图像分类器移除不当视觉内容
- 过滤掉包含不当或敏感内容子字符串的URL
不过,用户需注意,由于本数据集源自公开网络,仍可能包含部分敏感或个人信息。数据集开发者已意识到这一局限性,并建议用户谨慎使用,或根据具体使用场景额外添加过滤步骤。
## 偏差、风险与局限性
本数据集存在以下潜在偏差、风险与局限性:
1. 数据偏差:由于本数据集源自网络爬取数据,可能继承在线内容中存在的各类偏差。
2. 内容风险:尽管经过多轮过滤,本数据集仍可能残留部分冒犯性、不敏感或不当内容。
3. 图像可用性:本数据集依赖外部图像URL,随着时间推移可能因链接失效(link rot)导致部分图像无法访问,可能影响数据集的长期可用性。
4. PDF解析局限性:当前从PDF中提取阅读顺序的方法可能无法始终准确还原文档的预期流向,尤其是对于布局复杂的文档。
5. 潜在法律与伦理问题:尽管我们尽力遵守robots.txt协议并移除敏感信息,仍可能存在部分未获得个人明确同意就被收录的内容。
### 建议
针对上述情况,我们提出以下建议:
1. 额外过滤:强烈建议用户根据自身使用场景与伦理考量,额外添加过滤步骤。
2. 不当使用场景:本数据集不建议用于处理或生成个人身份信息的应用,也不建议用于军事用途。
3. 法律合规:用户在将本数据集用于商业用途前,应独立验证其符合适用法律法规。
4. 偏差认知:研究人员与开发者应意识到本数据集可能存在的偏差,并考虑其对模型训练与输出的影响。
## 许可证
我们采用CC-BY-4.0许可证发布🍃 MINT-1T,本数据集主要作为研究成果公开。尽管本数据集可免费获取,用户需对其商业使用的合法性负责。用户在将本数据集用于商业用途前,应独立验证其符合适用法律法规。
## 引用格式
@article{awadalla2024mint1t,
title={MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens},
author={Anas Awadalla and Le Xue and Oscar Lo and Manli Shu and Hannah Lee and Etash Kumar Guha and Matt Jordan and Sheng Shen and Mohamed Awadalla and Silvio Savarese and Caiming Xiong and Ran Xu and Yejin Choi and Ludwig Schmidt},
year={2024}
}
提供机构:
maas
创建时间:
2024-07-27



