REFINESUMM
收藏REFINESUMM: 自精炼多模态语言模型生成多模态摘要数据集
数据集概述
- 名称: REFINESUMM
- 类型: 多模态摘要数据集
- 目标: 训练和评估视觉-语言模型,用于图像-文本多模态摘要任务
- 内容: 包含文本、相关图像和基于维基百科文章及其附带图像的摘要的三元组
- 生成模型: 使用多模态大语言模型(LLaVA-v1.6-Mistral-7B)自动生成摘要,并通过自精炼过程进行优化
数据集下载
- 下载地址: Hugging Face
数据加载
- 步骤:
-
下载WikiWeb2M的测试集: python wget https://storage.googleapis.com/gresearch/wit/wikiweb2m/wikiweb2m-test.tfrecord.gz
-
将下载的文件放置在
data/目录下 -
在
python update_data_from_wikiweb2m.py文件的第12行设置分割(train,val,test) -
运行以下命令: python python update_data_from_wikiweb2m.py
-
数据集将被保存到
data/目录下,包含txt(文章)、img(图像)和summary(摘要)列
-
引用
-
引用格式:
@inproceedings{patil-etal-2024-refinesumm, title = "{REFINESUMM}: Self-Refining {MLLM} for Generating a Multimodal Summarization Dataset", author = "Patil, Vaidehi and Ribeiro, Leonardo and Liu, Mengwen and Bansal, Mohit and Dreyer, Markus", editor = "Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek", booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = aug, year = "2024", address = "Bangkok, Thailand", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.acl-long.743", pages = "13773--13786", abstract = "Multimodal Large Language Models (MLLMs) excel at synthesizing key information from diverse sources. However, generating accurate and faithful multimodal summaries is challenging, primarily due to the lack of appropriate multimodal datasets for fine-tuning that meaningfully integrate textual and visual modalities. To address this gap, we present a new dataset designed specifically for image-text multimodal summarization, harnessing the capabilities of state-of-the-art MLLMs. We generate summaries from Wikipedia sections and corresponding images and evaluate them across text-based, visual and multimodal dimensions, employing reference-free metrics. To refine the dataset, we: (1) Filter the MLLM-generated summaries by training a critic model on human annotations and using its predictions to remove low-quality summaries; (2) Fine-tune the MLLM with the filtered high-quality summaries; (3) Use the fine-tuned model in turn to regenerate the summaries. This self-refinement process significantly improves summary quality, as measured by human judgements and automatic multimodal metrics, resulting in a valuable dataset for multimodal summarization research. The dataset is publicly available at https://github.com/amazon-science/refinesumm.", }




