five

OBELICS

收藏
魔搭社区2026-01-09 更新2025-08-02 收录
下载链接:
https://modelscope.cn/datasets/HuggingFaceM4/OBELICS
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for OBELICS ## Dataset Description - **Visualization of OBELICS web documents:** https://huggingface.co/spaces/HuggingFaceM4/obelics_visualization - **Paper:** [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://arxiv.org/abs/2306.16527) - **Repository:** https://github.com/huggingface/OBELICS - **Point of Contact: hugo@huggingface.co** `OBELICS` is an open, massive, and curated collection of interleaved image-text web documents, containing 141M English documents, 115B text tokens, and 353M images, extracted from Common Crawl dumps between February 2020 and February 2023. The collection and filtering steps are described in our [paper](https://huggingface.co/papers/2306.16527). Interleaved image-text web documents are a succession of text paragraphs interleaved by images, such as web pages that contain images. Models trained on these web documents outperform vision and language models trained solely on image-text pairs on various benchmarks. They can also generate long and coherent text about a set of multiple images. As an example, we trained [IDEFICS](https://huggingface.co/HuggingFaceM4/idefics-80b), a visual language model that accepts arbitrary sequences of image and text inputs and produces text outputs. We provide an [interactive visualization](https://atlas.nomic.ai/map/f2fba2aa-3647-4f49-a0f3-9347daeee499/ee4a84bd-f125-4bcc-a683-1b4e231cb10f) of OBELICS that allows exploring the content of OBELICS. The map shows a subset of 11M of the 141M documents. [![OBELICS Nomic map](assets/nomic_map.png)](https://atlas.nomic.ai/map/f2fba2aa-3647-4f49-a0f3-9347daeee499/ee4a84bd-f125-4bcc-a683-1b4e231cb10f) ## Data Fields An example of a sample looks as follows: ``` # The example has been cropped { 'images': [ 'https://cdn.motor1.com/images/mgl/oRKO0/s1/lamborghini-urus-original-carbon-fiber-accessories.jpg', None ], 'metadata': '[{"document_url": "https://lamborghinichat.com/forum/news/vw-group-allegedly-receives-offer-to-sell-lamborghini-for-9-2-billion.728/", "unformatted_src": "https://cdn.motor1.com/images/mgl/oRKO0/s1/lamborghini-urus-original-carbon-fiber-accessories.jpg", "src": "https://cdn.motor1.com/images/mgl/oRKO0/s1/lamborghini-urus-original-carbon-fiber-accessories.jpg", "formatted_filename": "lamborghini urus original carbon fiber accessories", "alt_text": "VW Group Allegedly Receives Offer To Sell Lamborghini For $9.2 Billion", "original_width": 1920, "original_height": 1080, "format": "jpeg"}, null]', 'general_metadata': '{"url": "https://lamborghinichat.com/forum/news/vw-group-allegedly-receives-offer-to-sell-lamborghini-for-9-2-billion.728/", "warc_filename": "crawl-data/CC-MAIN-2021-25/segments/1623488528979.69/warc/CC-MAIN-20210623011557-20210623041557-00312.warc.gz", "warc_record_offset": 322560850, "warc_record_length": 17143}', 'texts': [ None, 'The buyer would get everything, including Lambo\'s headquarters.\n\nThe investment groupQuantum Group AG has submitted a€7.5 billion ($9.2 billion at current exchange rates) offer to purchase Lamborghini from Volkswagen Group, Autocar reports. There\'s no info yet about whether VW intends to accept the offer or further negotiate the deal.\n\nQuantum ... Group Chief Executive Herbert Diess said at the time.' ] } ``` Each sample is composed of the same 4 fields: `images`, `texts`, `metadata`, and `general_metadata`. `images` and `texts` are two lists of the same size, where for each index, one element and only one is not `None`. For example, for the interleaved web document `<image_1>text<image_2>`, we would find `[image_1, None, image_2]` in `images` and `[None, text, None]` in `texts`. The images are replaced by their URLs, and the users need to download the images, for instance, with the library [img2dataset](https://github.com/rom1504/img2dataset). `metadata` is the string representation of a list containing information about each of the images. It has the same length as `texts` and `images` and logs for each image relevant information such as original source document, unformatted source, alternative text if present, etc. `general_metadata` is the string representation of a dictionary containing the URL of the document, and information regarding the extraction from Common Crawl snapshots. ## Size and Data Splits There is only one split, `train`, that contains 141,047,697 documents. `OBELICS` with images replaced by their URLs weighs 666.6 GB (😈) in arrow format and 377 GB in the uploaded `parquet` format. ## Considerations for Using the Data ### Discussion of Biases A subset of this dataset `train`, of ~50k was evaluated using the Data Measurements Tool, with a particular focus on the nPMI metric > nPMI scores for a word help to identify potentially problematic associations, ranked by how close the association is. > nPMI bias scores for paired words help to identify how word associations are skewed between the selected selected words (Aka et al., 2021). > You can select from gender and sexual orientation identity terms that appear in the dataset at least 10 times. > The resulting ranked words are those that co-occur with both identity terms. > The more positive the score, the more associated the word is with the first identity term. The more negative the score, the more associated the word is with the second identity term. While there was a positive skew of words relating occupations e.g _`government`_, _`jobs`_ towards she, her, and similar attributions of the masculine and feminine words to they and them, more harmful words attributions such as _`escort`_ and even _`colour`_ presented with greater attributions to she, her and him, his, respectively. ![Data Measurement Tool Associations Eval](assets/DMT_eval.png) We welcome users to explore the [Data Measurements nPMI Visualitons for OBELICS](https://huggingface.co/spaces/HuggingFaceM4/IDEFICS_Data_Measurement_Tool) further and to see the [idefics-9b model card](https://huggingface.co/HuggingFaceM4/idefics-9b) for further Bias considerations. ## Opted-out content To respect the preferences of content creators, we removed from OBELICS all images for which creators explicitly opted out of AI model training. We used the [Spawning API](https://api.spawning.ai/spawning-api) to verify that the images in the dataset respect the original copyright owners’ choices. However, due to an error on our side, we did not remove entire documents (i.e., URLs) that opted out of AI model training. As of July 12, 2023, it represents 4.25% of the totality of OBELICS. The config `opt_out_docs_removed_2023_07_12` applies the correct filtering at the web document level as of July 2023: `ds = load_dataset("HuggingFaceM4/OBELICS", "opt_out_docs_removed_2023_07_12")`. We recommend users of OBELICS to regularly check every document against the API. ## Content warnings Despite our efforts in filtering, OBELICS contains a small proportion of documents that are not suitable for all audiences. For instance, while navigating the interactive map, you might find the cluster named "Sex" which predominantly contains descriptions of pornographic movies along with pornographic images. Other clusters would contain advertising for sex workers or reports of violent shootings. In our experience, these documents represent a small proportion of all the documents. ## Terms of Use By using the dataset, you agree to comply with the original licenses of the source content as well as the dataset license (CC-BY-4.0). Additionally, if you use this dataset to train a Machine Learning model, you agree to disclose your use of the dataset when releasing the model or an ML application using the model. ### Licensing Information License CC-BY-4.0. ### Citation Information If you are using this dataset, please cite ``` @misc{laurencon2023obelics, title={OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents}, author={Hugo Laurençon and Lucile Saulnier and Léo Tronchon and Stas Bekman and Amanpreet Singh and Anton Lozhkov and Thomas Wang and Siddharth Karamcheti and Alexander M. Rush and Douwe Kiela and Matthieu Cord and Victor Sanh}, year={2023}, eprint={2306.16527}, archivePrefix={arXiv}, primaryClass={cs.IR} } ```

# OBELICS 数据集卡片 ## 数据集描述 - **OBELICS 网页文档可视化**:https://huggingface.co/spaces/HuggingFaceM4/obelics_visualization - **相关论文**:[OBELICS:一个开源、大规模、经筛选的交错图文网页文档数据集](https://arxiv.org/abs/2306.16527) - **代码仓库**:https://github.com/huggingface/OBELICS - **联系方式**:hugo@huggingface.co `OBELICS` 是一个开源、超大规模且经精心筛选的交错图文网页文档集合,包含1.41亿份英文文档、1150亿文本Token(Token)以及3.53亿张图像,数据提取自2020年2月至2023年2月期间的Common Crawl爬取快照。该数据集的构建与筛选流程详见我们的[论文](https://huggingface.co/papers/2306.16527)。 交错图文网页文档指的是由图像穿插于文本段落之间的文档结构,例如包含图像的普通网页。在这类文档上训练的模型,在诸多基准测试中的表现优于仅在图文对数据上训练的视觉语言模型。此外,这类模型还可针对多组图像生成连贯的长文本内容。例如,我们已训练了[IDEFICS](https://huggingface.co/HuggingFaceM4/idefics-80b)——一款可接收任意图像与文本输入序列并生成文本输出的视觉语言模型。 我们提供了OBELICS的[交互式可视化工具](https://atlas.nomic.ai/map/f2fba2aa-3647-4f49-a0f3-9347daeee499/ee4a84bd-f125-4bcc-a683-1b4e231cb10f),支持用户探索数据集内容。该可视化地图展示了1.41亿份文档中的1100万份子集。 [![OBELICS Nomic 可视化地图](assets/nomic_map.png)](https://atlas.nomic.ai/map/f2fba2aa-3647-4f49-a0f3-9347daeee499/ee4a84bd-f125-4bcc-a683-1b4e231cb10f) ## 数据字段 以下为单条样本的示例(已做裁剪): # 示例已做裁剪 { 'images': [ 'https://cdn.motor1.com/images/mgl/oRKO0/s1/lamborghini-urus-original-carbon-fiber-accessories.jpg', None ], 'metadata': '[{"document_url": "https://lamborghinichat.com/forum/news/vw-group-allegedly-receives-offer-to-sell-lamborghini-for-9-2-billion.728/", "unformatted_src": "https://cdn.motor1.com/images/mgl/oRKO0/s1/lamborghini-urus-original-carbon-fiber-accessories.jpg", "src": "https://cdn.motor1.com/images/mgl/oRKO0/s1/lamborghini-urus-original-carbon-fiber-accessories.jpg", "formatted_filename": "lamborghini urus original carbon fiber accessories", "alt_text": "VW Group Allegedly Receives Offer To Sell Lamborghini For $9.2 Billion", "original_width": 1920, "original_height": 1080, "format": "jpeg"}, null]', 'general_metadata': '{"url": "https://lamborghinichat.com/forum/news/vw-group-allegedly-receives-offer-to-sell-lamborghini-for-9-2-billion.728/", "warc_filename": "crawl-data/CC-MAIN-2021-25/segments/1623488528979.69/warc/CC-MAIN-20210623011557-20210623041557-00312.warc.gz", "warc_record_offset": 322560850, "warc_record_length": 17143}', 'texts': [ None, 'The buyer would get everything, including Lambo's headquarters. The investment groupQuantum Group AG has submitted a€7.5 billion ($9.2 billion at current exchange rates) offer to purchase Lamborghini from Volkswagen Group, Autocar reports. There's no info yet about whether VW intends to accept the offer or further negotiate the deal. Quantum ... Group Chief Executive Herbert Diess said at the time.' ] } 每条样本均包含四个固定字段:`images`、`texts`、`metadata`与`general_metadata`。其中`images`与`texts`为两个长度一致的列表,对于每个索引位置,二者中仅有一个元素不为`None`。例如,对于交错图文文档 `<图像1>文本<图像2>`,`images`列表应为`[图像1, None, 图像2]`,`texts`列表则为`[None, 文本, None]`。 图像以其URL链接的形式存储,用户可借助[img2dataset](https://github.com/rom1504/img2dataset)等库下载图像。 `metadata`为列表的字符串形式,存储每张图像的相关信息,其长度与`images`、`texts`一致,记录了每张图像的原始来源文档、未格式化源地址、替代文本(如存在)等相关元数据。 `general_metadata`为字典的字符串形式,包含文档的URL以及从Common Crawl快照中提取的相关信息。 ## 数据集规模与数据划分 该数据集仅包含一个划分`train`,内含141,047,697份文档。 `OBELICS`以Arrow格式存储且图像替换为URL链接时,数据集大小为666.6 GB;上传至Hugging Face Hub的Parquet格式版本大小为377 GB。 ## 数据集使用注意事项 ### 偏差分析 我们使用数据测量工具(Data Measurements Tool)对该数据集`train`划分中的约5万份子集进行了评估,重点关注nPMI指标: > 单词的nPMI得分可用于识别潜在的问题关联,得分越接近1则关联程度越强。 > 成对单词的nPMI偏差得分可用于识别所选单词之间的关联偏向(Aka等人,2021)。 > 用户可选择数据集中出现次数不少于10次的性别与性取向身份术语。 > 最终得到的排名词汇为与所有身份术语共现的词汇。 > 得分越正,则该词汇与第一个身份术语的关联越强;得分越负,则该词汇与第二个身份术语的关联越强。 分析结果显示,职业相关词汇(如`government`、`jobs`)存在偏向“她”“她的”等女性指代的正向偏移,同时男性指代词汇与“they”“them”等复数指代的关联也存在偏向。此外,一些更具危害性的词汇关联,如`escort`甚至`colour`,分别与“她/她的”及“他/他的”存在更强的关联。 ![数据测量工具关联评估结果](assets/DMT_eval.png) 我们欢迎用户探索针对OBELICS的[数据测量nPMI可视化工具](https://huggingface.co/spaces/HuggingFaceM4/IDEFICS_Data_Measurement_Tool),并可查阅[idefics-9b 模型卡片](https://huggingface.co/HuggingFaceM4/idefics-9b)以了解更多偏差相关的考量事项。 ### 已选择退出的内容 为尊重内容创作者的意愿,我们已从OBELICS中移除所有创作者明确声明不允许用于AI模型训练的图像。我们使用[Spawning API](https://api.spawning.ai/spawning-api)验证了数据集中的图像均符合原版权所有者的使用要求。 但由于我们的操作失误,未移除整体声明不允许用于AI模型训练的网页文档(即对应的URL)。截至2023年7月12日,这类文档占OBELICS总文档数的4.25%。配置项`opt_out_docs_removed_2023_07_12`可应用截至2023年7月的网页文档级筛选:`ds = load_dataset("HuggingFaceM4/OBELICS", "opt_out_docs_removed_2023_07_12")`。 我们建议OBELICS的使用者定期通过该API核对每份文档的使用权限。 ### 内容警示 尽管我们已进行了多轮筛选,OBELICS中仍存在少量不适合所有受众的文档。例如,在浏览交互式可视化地图时,您可能会遇到名为“Sex”的聚类,其中主要包含色情电影的描述与色情图像。其他聚类则可能包含性工作者广告或暴力枪击事件的报道。根据我们的经验,这类文档仅占总文档数的极小比例。 ### 使用条款 使用本数据集即表示您同意遵守源内容的原始许可协议以及本数据集的许可协议(CC-BY-4.0)。此外,若您使用本数据集训练机器学习模型,则同意在发布该模型或基于该模型开发的机器学习应用时,披露本数据集的使用情况。 #### 许可信息 许可协议为CC-BY-4.0。 #### 引用信息 若您使用本数据集,请引用以下文献: @misc{laurencon2023obelics, title={OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents}, author={Hugo Laurençon and Lucile Saulnier and Léo Tronchon and Stas Bekman and Amanpreet Singh and Anton Lozhkov and Thomas Wang and Siddharth Karamcheti and Alexander M. Rush and Douwe Kiela and Matthieu Cord and Victor Sanh}, year={2023}, eprint={2306.16527}, archivePrefix={arXiv}, primaryClass={cs.IR} }
提供机构:
maas
创建时间:
2025-08-01
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
OBELICS是一个大规模的交错图像-文本网页文档数据集,包含1.41亿个文档和3.53亿张图像,适用于训练视觉语言模型。数据集经过严格筛选,并提供了详细的元数据,支持多种应用场景。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作