WAON

Name: WAON
Creator: maas
Published: 2025-11-27 16:55:46
License: 暂无描述

魔搭社区2025-11-27 更新2025-11-29 收录

下载链接：

https://modelscope.cn/datasets/llm-jp/WAON

下载链接

链接失效反馈

官方服务：

资源简介：

<div align="center" style="line-height: 1;"> <h1>WAON: Large-Scale and High-Quality Japanese Image-Text Pair Dataset for Vision-Language Models </h1> | <a href="https://huggingface.co/collections/llm-jp/waon" target="_blank">🤗 HuggingFace</a>  | <a href="https://arxiv.org/abs/2510.22276" target="_blank">📄 Paper</a>  | <a href="https://github.com/llm-jp/WAON" target="_blank">🧑‍💻 Code</a>  | <br/> <img src="validation_top1_accuracy.svg" width="50%"/> </div> ## Introduction WAON is a Japanese (image, text) pair dataset containing approximately 155M examples, crawled from Common Crawl. It is built from snapshots taken in 2025-18, 2025-08, 2024-51, 2024-42, 2024-33, and 2024-26. The dataset is high-quality and diverse, constructed through a sophisticated data processing pipeline. We apply filtering based on image size and SigLIP scores, and perform deduplication using URLs, captions, and perceptual hashes (pHash). ## How to Use Clone the repository: ```bash git clone https://gitlab.llm-jp.nii.ac.jp/datasets/waon.git cd waon ``` Load the dataset using the `datasets` library: ```python from datasets import load_dataset ds = load_dataset("parquet", data_dir="data") ``` ### Format - `url`: URL of the image - `caption`: Caption associated with the image - `page_title`: Title of the page containing the image - `page_url`: URL of the page - `punsafe`: Probability that the image is unsafe - `quality`: The quality of the text in the text column - `width`: Width (in pixels) of the resized image used for computing pHash - `height`: Height (in pixels) of the resized image used for computing pHash - `original_width`: Original width of the image - `original_height`: Original height of the image - `sha256`: SHA-256 hash of the original image file - `phash`: Perceptual hash (pHash) computed from the resized image ## Dataset Construction Pipeline We construct WAON dataset through the following steps (The numbers in parentheses indicate the remaining data count after each processing step (based on the 2025-18 snapshot): <div align="center"> <img src="waon-pipeline.svg" width="50%"/> </div> ## LICENSE This dataset (not including images themselves) is licensed under the Apache License 2.0 and governed by Japanese law. Its use is limited to “information analysis” as defined in Article 30-4 of the Japanese Copyright Act. ## Citation ```bibtex @misc{sugiura2025waonlargescalehighqualityjapanese, title={WAON: Large-Scale and High-Quality Japanese Image-Text Pair Dataset for Vision-Language Models}, author={Issa Sugiura and Shuhei Kurita and Yusuke Oda and Daisuke Kawahara and Yasuo Okabe and Naoaki Okazaki}, year={2025}, eprint={2510.22276}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.22276}, } ```

<div align="center" style="line-height: 1;"> <h1>WAON：面向视觉语言模型的大规模高质量日语图文配对数据集</h1> | <a href="https://huggingface.co/collections/llm-jp/waon" target="_blank">🤗 HuggingFace</a>  | <a href="https://arxiv.org/abs/2510.22276" target="_blank">📄 论文</a>  | <a href="https://github.com/llm-jp/WAON" target="_blank">🧑‍💻 代码</a>  | <br/> <img src="validation_top1_accuracy.svg" width="50%"/> </div> ## 简介 WAON是一套日语图文（图像-文本）配对数据集，包含约1.55亿条样本，数据源自Common Crawl的爬取内容。本数据集基于2025-18、2025-08、2024-51、2024-42、2024-33以及2024-26六个快照构建而成。通过精密设计的数据处理流程，本数据集兼具高质量与多样性特征。我们采用基于图像尺寸与SigLIP得分的筛选策略，并通过URL、图像标题以及感知哈希（perceptual hash, pHash）完成去重操作。 ## 使用方法克隆数据集仓库： bash git clone https://gitlab.llm-jp.nii.ac.jp/datasets/waon.git cd waon 使用`datasets`库加载数据集： python from datasets import load_dataset ds = load_dataset("parquet", data_dir="data") ### 数据格式各字段说明如下： - `url`：图像的URL地址 - `caption`：与该图像关联的标题文本 - `page_title`：包含该图像的网页标题 - `page_url`：对应网页的URL地址 - `punsafe`：图像包含不安全内容的概率值 - `quality`：文本列内容的质量评分 - `width`：用于计算感知哈希（pHash）的调整后图像的像素宽度 - `height`：用于计算感知哈希（pHash）的调整后图像的像素高度 - `original_width`：图像的原始宽度 - `original_height`：图像的原始高度 - `sha256`：原始图像文件的SHA-256哈希值 - `phash`：从调整后图像计算得到的感知哈希（pHash） ## 数据集构建流程我们通过以下步骤构建WAON数据集（括号内的数字表示各处理步骤完成后剩余的数据量，基于2025-18快照）： <div align="center"> <img src="waon-pipeline.svg" width="50%"/> </div> ## 许可证本数据集（不包含图像本身）采用Apache License 2.0许可证，并受日本法律管辖。其使用范围仅限《日本著作权法》第30条之4所定义的"信息分析"用途。 ## 引用格式 bibtex @misc{sugiura2025waonlargescalehighqualityjapanese, title={WAON: Large-Scale and High-Quality Japanese Image-Text Pair Dataset for Vision-Language Models}, author={Issa Sugiura and Shuhei Kurita and Yusuke Oda and Daisuke Kawahara and Yasuo Okabe and Naoaki Okazaki}, year={2025}, eprint={2510.22276}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.22276}, }

提供机构：

maas

创建时间：

2025-11-24

5,000+

优质数据集

54 个

任务类型

进入经典数据集