下载链接：

https://modelscope.cn/datasets/Data-Juicer/llava-pretrain-refined-by-data-juicer

下载链接

链接失效反馈

官方服务：

资源简介：

# LLaVA pretrain -- LCS-558k (refined by Data-Juicer) A refined version of LLaVA pretrain dataset (LCS-558k) by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Multimodal Large Language Model. **Notice**: Here is a small subset for previewing. The whole dataset is available [here](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/LLaVA-1.5/public/llava-pretrain-refine-result.json) (About 115MB). ## Dataset Information - Number of samples: 500,380 (Keep ~89.65% from the original dataset) ## Refining Recipe ```yaml project_name: 'llava-1.5-pretrain-dataset-refine-recipe' dataset_path: 'blip_laion_cc_sbu_558k_dj_fmt_only_caption.jsonl' # converted LLaVA pretrain dataset in Data-Juicer format with only_keep_caption is True. See tools/multimodal/source_format_to_data_juicer_format/llava_to_dj.py export_path: 'blip_laion_cc_sbu_558k_dj_fmt_only_caption_refined.jsonl' np: 42 # number of subprocess to process your dataset text_keys: 'text' # the key name of field where the sample texts to be processed, e.g., `text`, `instruction`, `output`, ... # for multimodal data processing image_key: 'images' # Key name of field to store the list of sample image paths. image_special_token: '<image>' # The special token that represents an image in the text. For LLaVA, it's "<image>". Should be aligned with the args when running conversion tools. eoc_special_token: '<|__dj__eoc|>' # The special token that represents the end of a chunk in the text. In default, it's "<|__dj__eoc|>". You can specify your own special token according to your input dataset. Should be aligned with the args when running conversion tools. open_tracer: true # process schedule: a list of several process operators with their arguments process: - fix_unicode_mapper: # fix unicode errors in text. - punctuation_normalization_mapper: # normalize unicode punctuations to English punctuations. # 558128 # Filter ops - alphanumeric_filter: #558087 # filter text with alphabet/numeric ratio out of specific range. tokenization: false # Whether to count the ratio of alphanumeric to the total number of tokens. min_ratio: 0.60 # the min ratio of filter range - character_repetition_filter: #546105 # filter text with the character repetition ratio out of specific range rep_len: 10 # repetition length for char-level n-gram max_ratio: 0.09373663 # the max ratio of filter range - flagged_words_filter: #543960 # filter text with the flagged-word ratio larger than a specific max value lang: en # consider flagged words in what language tokenization: false # whether to use model to tokenize documents max_ratio: 0.0 # the max ratio to filter text - perplexity_filter: #532029 # filter text with perplexity score out of specific range lang: en # compute perplexity in what language max_ppl: 14435.5806 # the max perplexity score to filter text - special_characters_filter: #531968 # filter text with special-char ratio out of specific range min_ratio: 0.16534802 # the min ratio of filter range max_ratio: 0.42023757 # the max ratio of filter range - word_repetition_filter: # 530773 # filter text with the word repetition ratio out of specific range lang: en # sample in which language tokenization: false # whether to use model to tokenize documents rep_len: 10 # repetition length for word-level n-gram max_ratio: 0.03085751 # the max ratio of filter range - image_aspect_ratio_filter: #542389 # filter samples according to the aspect ratios of images (a fraction of width by height, r=w/h) in them min_ratio: 0.333 # the min aspect ratio of filter range max_ratio: 3.0 # the max aspect ratio of filter range any_or_all: any # keep this sample when any/all images meet the filter condition - image_shape_filter: #533966 # filter samples according to the widths and heights of images in them max_width: 727.8798422276 # the max width of width filter range max_height: 606.2421072264 # the max height of height filter range any_or_all: any # keep this sample when any/all images meet the filter condition - image_size_filter: # 533966 # filter samples according to the size of images (in bytes) within them max_size: "124KB" # the max size of filter range any_or_all: any # keep this sample when any/all images meet the filter condition - image_text_similarity_filter: #544202 # filter samples according to the similarity between text and images. hf_clip: openai/clip-vit-base-patch32 # name of used Hugging Face clip min_score: 0.20315419 # the min similarity of filter range - image_text_matching_filter: # filter samples according to the matching score between image and text. hf_blip: Salesforce/blip-itm-base-coco # name of used Hugging Face blip min_score: 0.44930778 # the min matching score of filter range ```

# LLaVA 预训练数据集 -- LCS-558k（经Data-Juicer优化精炼版）本数据集为经[Data-Juicer](https://github.com/alibaba/data-juicer)优化精炼的LLaVA预训练数据集（LCS-558k）的改良版本，通过从原始数据集中剔除部分低质量样本，以提升数据集整体质量。该数据集通常用于多模态大语言模型（Multimodal Large Language Model）的预训练任务。 **注意**：此处仅提供用于预览的小型子集，完整数据集可通过[此处](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/LLaVA-1.5/public/llava-pretrain-refine-result.json)获取（大小约115MB）。 ## 数据集信息 - 样本总量：500,380条，保留原始数据集约89.65%的样本 ## 精炼流程 yaml project_name: 'llava-1.5-预训练数据集精炼流程' dataset_path: 'blip_laion_cc_sbu_558k_dj_fmt_only_caption.jsonl' # 仅保留字幕的Data-Juicer格式转换后的LLaVA预训练数据集。转换脚本可参考tools/multimodal/source_format_to_data_juicer_format/llava_to_dj.py export_path: 'blip_laion_cc_sbu_558k_dj_fmt_only_caption_refined.jsonl' np: 42 # 数据集处理所用子进程数量 text_keys: 'text' # 待处理文本所在字段的键名，例如`text`、`instruction`、`output`等。 # 多模态数据处理配置 image_key: 'images' # 存储样本图像路径列表的字段键名。 image_special_token: '<image>' # 文本中代表图像的特殊标记，针对LLaVA模型为"<image>"，需与转换脚本运行时参数保持一致。 eoc_special_token: '<|__dj__eoc|>' # 文本中代表数据块结束的特殊标记，默认值为"<|__dj__eoc|>"，可根据输入数据集自定义，需与转换脚本运行时参数保持一致。 open_tracer: true # 处理流程：由多个处理算子及其参数组成的列表 process: - fix_unicode_mapper: # 修复文本中的Unicode编码错误。 - punctuation_normalization_mapper: # 将Unicode标点符号规范化为英文标点。 # 原始样本量：558128 # 过滤类算子 - alphanumeric_filter: #558087 # 过滤字母数字占比不在指定范围内的文本。 tokenization: false # 是否统计字母数字占Token总数的比例。 min_ratio: 0.60 # 过滤范围的最小占比阈值 - character_repetition_filter: #546105 # 过滤字符重复率不在指定范围内的文本 rep_len: 10 # 字符级n-gram的重复长度阈值 max_ratio: 0.09373663 # 过滤范围的最大重复率阈值 - flagged_words_filter: #543960 # 过滤敏感词占比超过指定最大值的文本 lang: en # 敏感词匹配所用语言 tokenization: false # 是否使用模型对文档进行分词 max_ratio: 0.0 # 敏感词占比的最大允许阈值 - perplexity_filter: #532029 # 过滤困惑度得分不在指定范围内的文本 lang: en # 困惑度计算所用语言 max_ppl: 14435.5806 # 过滤范围的最大困惑度阈值 - special_characters_filter: #531968 # 过滤特殊字符占比不在指定范围内的文本 min_ratio: 0.16534802 # 过滤范围的最小占比阈值 max_ratio: 0.42023757 # 过滤范围的最大占比阈值 - word_repetition_filter: # 530773 # 过滤词重复率不在指定范围内的文本 lang: en # 待处理文本的语言 tokenization: false # 是否使用模型对文档进行分词 rep_len: 10 # 词级n-gram的重复长度阈值 max_ratio: 0.03085751 # 词重复率的最大允许阈值 - image_aspect_ratio_filter: #542389 # 根据样本中图像的宽高比（宽高比值r=w/h）过滤样本 min_ratio: 0.333 # 过滤范围的最小宽高比阈值 max_ratio: 3.0 # 过滤范围的最大宽高比阈值 any_or_all: any # 当任意/所有图像满足过滤条件时保留该样本 - image_shape_filter: #533966 # 根据样本中图像的宽高尺寸过滤样本 max_width: 727.8798422276 # 过滤范围的最大宽度阈值 max_height: 606.2421072264 # 过滤范围的最大高度阈值 any_or_all: any # 当任意/所有图像满足过滤条件时保留该样本 - image_size_filter: # 533966 # 根据样本中图像的文件大小（字节数）过滤样本 max_size: "124KB" # 过滤范围的最大文件大小阈值 any_or_all: any # 当任意/所有图像满足过滤条件时保留该样本 - image_text_similarity_filter: #544202 # 根据文本与图像的相似度过滤样本 hf_clip: openai/clip-vit-base-patch32 # 所用Hugging Face CLIP模型名称 min_score: 0.20315419 # 过滤范围的最小相似度阈值 - image_text_matching_filter: # 根据图像与文本的匹配度过滤样本 hf_blip: Salesforce/blip-itm-base-coco # 所用Hugging Face BLIP模型名称 min_score: 0.44930778 # 过滤范围的最小匹配度阈值

应用场景：