AnyWord-3M
收藏魔搭社区2026-05-21 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/iic/AnyWord-3M
下载链接
链接失效反馈官方服务:
资源简介:
## 数据集简介
目前,针对文字生成任务的公开数据集尤其是涉及非拉丁语系语言的,还相对缺乏。因此,我们提出了一个大规模多语言数据集AnyWord-3M。数据集中的图片的来源包括Noah-Wukong、LAION-400M以及用于OCR识别任务的数据集,如ArT、COCO-Text、RCTW、LSVT、MLT、MTWI、ReCTS等。这些图片涵盖了包含文本的多种场景,包括街景、书籍封面、广告、海报、电影帧等。除了OCR数据集直接使用标注的信息外,所有其他图片都通过使用PP-OCR的检测和识别模型进行处理。然后,使用BLIP-2生成文本描述。通过严格的过滤规则和细致的后处理,我们共获得了3,034,486张图片,包含超过900万行文本和超过2000万个字符或拉丁文字。此外,我们从Wukong和LAION子集中随机抽取了1000张图片,创建了评估集[AnyText-benchmark](https://modelscope.cn/datasets/iic/AnyText-benchmark/summary),专门用于评估中英文生成的准确性和质量。剩余的图片作为训练集AnyWord-3M,其中大约有160万张是中文,139万张是英文,另外还有1万张包含其他语言的图片,包括日语、韩语、阿拉伯语、孟加拉语和印地语。详细的统计分析和随机选取的示例图片,请参考我们的论文[AnyText](https://arxiv.org/abs/2311.03054). (注:本次开源的数据集为V1.1版本)
*注意: laion部分之前采用分卷压缩,在解压时不方便,现分成5个zip包,每个可以独立解压。将所有laion_p[1-5].zip的图片都解压到imgs文件夹即可。*
20250228更新: 补充了AnyText2使用的训练数据集,图片数据保持一致,请解压anytext2_json_files.zip后替换其中最新的json文件即可。详见:[论文](https://arxiv.org/abs/2411.15245)及[代码](https://github.com/tyxsspa/AnyText2)。
#### 下载方法
:modelscope-code[]{type="sdk"}
:modelscope-code[]{type="git"}
Dataset Introduction
Currently, public datasets for text generation tasks, especially those involving non-Latin languages, are relatively scarce. Therefore, we propose a large-scale multilingual dataset, AnyWord-3M.
The images in the dataset are sourced from Noah-Wukong, LAION-400M, and OCR recognition datasets including ArT, COCO-Text, RCTW, LSVT, MLT, MTWI, ReCTS, etc. These images cover diverse text-containing scenarios, such as street scenes, book covers, advertisements, posters, movie frames, and more.
Except for the OCR datasets which directly utilize their annotated information, all other images are processed using the PP-OCR detection and recognition models. Subsequently, text descriptions are generated via BLIP-2. Through strict filtering rules and meticulous post-processing, we have obtained a total of 3,034,486 images, containing over 9 million lines of text and more than 20 million characters (including Latin scripts).
Additionally, we randomly selected 1,000 images from the Wukong and LAION subsets to create the evaluation set [AnyText-benchmark](https://modelscope.cn/datasets/iic/AnyText-benchmark/summary), which is specifically designed to evaluate the accuracy and quality of Chinese and English text generation. The remaining images constitute the AnyWord-3M training set: approximately 1.6 million are Chinese, 1.39 million are English, and an additional 10,000 images contain other languages including Japanese, Korean, Arabic, Bengali, and Hindi.
For detailed statistical analyses and randomly selected sample images, please refer to our paper [AnyText](https://arxiv.org/abs/2311.03054). (Note: The open-sourced dataset released this time is version V1.1.)
*Note: Previously, the LAION part was packaged into multi-part compressed files, which were inconvenient to extract. It has now been split into 5 zip packages, each capable of independent extraction. Extract all images from laion_p[1-5].zip into the imgs folder.*
20250228 Update: Supplementary training datasets used for AnyText2 are added, and the image data remains unchanged. Please extract anytext2_json_files.zip and replace the latest json files within it. For details, please refer to the [paper](https://arxiv.org/abs/2411.15245) and [code](https://github.com/tyxsspa/AnyText2).
#### Download Methods
:modelscope-code[]{type="sdk"}
:modelscope-code[]{type="git"}
提供机构:
maas
创建时间:
2024-04-15
搜集汇总
数据集介绍

背景与挑战
背景概述
AnyWord-3M是一个大规模多语言文本生成数据集,包含约303万张图像,覆盖中文、英文及其他多种语言,专门用于支持非拉丁脚本语言的文本生成任务。数据来源于多个公开数据集和OCR任务,经过严格过滤和处理,适用于训练和评估文本生成模型,特别是针对图文融合和稳定扩散等应用场景。
以上内容由遇见数据集搜集并总结生成



