Align-Anything-TI2T-Instruction-100K

Name: Align-Anything-TI2T-Instruction-100K
Creator: maas
Published: 2025-11-17 16:16:05
License: 暂无描述

魔搭社区2025-11-17 更新2025-02-08 收录

下载链接：

https://modelscope.cn/datasets/PKU-Alignment/Align-Anything-TI2T-Instruction-100K

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Align-Anything : Text-Image-to-Text Instruction-Following Subset ## Text+Image → Text Instruction-Following Dataset [[🏠 Homepage](https://github.com/PKU-Alignment/align-anything)] [[🤗 Align-Anything Datasets](https://huggingface.co/datasets/PKU-Alignment/align-anything/)] [[🦫 Beaver-Vision-11B](https://huggingface.co/PKU-Alignment/Beaver-Vision-11B)] ### Highlights - **Input & Output Modalities:** Input: Text + Image; Output: Text - **100K QA Pairs:** Through refined construction based on constitutions, we obtained 103,012 QA pairs, with answers generated by GPT-4o. - **Beaver-Vision-11B:** Leveraging our high-quality TI2T Instruction-Following dataset, we trained a more powerful multimodal foundation model, [Beaver-Vision-11B](https://huggingface.co/PKU-Alignment/Beaver-Vision-11B), which outperforms [LLaMA-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct). ### 1. Dataset Summary `Align-Anything` is a comprehensive multi-modal dataset covering text, image, video, audio, and more. It includes input-output pairs involving single or mixed modalities, designed for the development of all-modality foundation models and for research in the multi-modal alignment community. This portion of the dataset focuses primarily on the `Text-Image-to-Text Instruction-Following` subset. It contains **103012** finely annotated QA-pairs. The images used as input are sourced from various well-known datasets, including: - [LAION-5B](https://laion.ai/blog/laion-5b/) - [Lexica](https://lexica.art/) - [LLaVA-Instruct-150K](https://laion.ai/blog/laion-5b/) - [COCO](https://huggingface.co/datasets/detection-datasets/coco) - [VQAv2](https://visualqa.org/) - [LCS-558K](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) - [ShareGPT4V-textvqa](https://github.com/ShareGPT4Omni/ShareGPT4V) - [UnsafeBench](https://huggingface.co/datasets/yiting/UnsafeBench) Based on carefully crafted prompts, we generated new questions across a range of tasks, including visual question answering (VQA) and complex reasoning, aimed at **enhancing the instruction-following capabilities of multimodal models.** These fine-tuned prompts were used to generate responses from GPT-4o. Through iterative annotation of language feedback and rigorous filtering, we finalized a dataset that can be used for supervised fine-tuning to strengthen multimodal foundation models. Some examples of the task categories included are: - **Simple Dialogue Communication:** General conversational questions designed to initiate engagement with the image. These questions are usually broad and aim to elicit a general response from the model about the image. - **Asking About Image-Related Details:** Specific questions targeting elements or details visible in the image, requiring close observation and precise description. These questions help test the model's ability to identify and describe particular parts of the image. - **Complex Reasoning Based on the Image:** Questions that involve reasoning or logical inference based on the content of the image. These tasks encourage the model to think critically, analyze the image, and draw logical conclusions or infer new information. - **Creative Generation Based on the Image:** Creative tasks prompting the model to generate new ideas or imaginative content inspired by the image. These questions are designed to enhance the model’s creative thinking and its ability to extend or reimagine content in different contexts. - ... ### 2. Usage If you want to use load_dataset(), you can directly use as follows: ```python from datasets import load_dataset train_dataset = load_dataset('PKU-Alignment/Align-Anything-TI2T-Instruction-100K',split='train') ``` ### 3. Dataset Analysis ### 3.1 Prompt Distribution We visualized the prompt distribution within the dataset and observed that the prompts are distributed uniformly and widely across various categories. ![Prompt Distribution](./analysis/prompt_distribution.png) ### 3.2 Model Performance Leveraging our high-quality TI2T Instruction-Following dataset, we trained a more powerful multimodal foundation model, [Beaver-Vision-11B](https://huggingface.co/PKU-Alignment/Beaver-Vision-11B), which outperforms [LLaMA-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct). Further performance and training details can be refer to [Beaver-Vision-11B](https://huggingface.co/PKU-Alignment/Beaver-Vision-11B). ### 4. Data Fields | Idx | Key | Description | |-----|-----------------------------|------------------------------------------------------------------------------------------------------------------------------------------| | 0 | `prompt` | The initial question or topic prompting the generation of responses. | | 1 | `image` | The input image of prompt. | | 2 | `response` | The output text response. | | 3 | `prompt_type` | The input prompt's category. | | 4 | `ori_dataset` | The input image source. | ### 5. Dataset Instance ### 5.1 Type: Instruction-Following (29964) **Image** <img src="./analysis/examples/instruction-following.png" alt="instruction-following" style="width:50%;"> **Data** ``` 'ori_dataset': 'created-instruction', 'prompt': 'How do the varying styles of hand sketches in the image influence the overall artistic expression?', 'prompt_type': 'instruction-following', 'response': "The varying styles of hand sketches in the image create a rich tapestry of artistic expression through multiple dimensions:\n\n1. **Technique and Medium**: Each sketch varies in technique and medium, such as pencil shading, cross-hatching, and line drawing. This diversity showcases different skills and approaches, adding depth and complexity to the overall artistic presentation.\n\n2. **Perspective and Composition**: Different angles and positions of the hands offer a dynamic range of perspectives, from realistic to more abstract forms. These varied compositions provide a multifaceted view of the human hand, emphasizing the artist's versatility.\n\n3. **Light and Shadow**: The use of light and shadow varies across the sketches, from high-contrast, dramatic lighting to softer, more gradual shading. This variation helps in highlighting the texture, form, and depth, contributing to a more engaging and lifelike portrayal.\n\n4. **Emotive Quality**: The portrayal of hands in different gestures and poses can evoke distinct emotions and narratives. Some sketches might convey strength or tension, while others might suggest delicacy or relaxation, enriching the interpretative potential of the artwork.\n\n5. **Detail and Simplification**: The degree of detail varies, with some sketches meticulously capturing intricate lines and skin textures, while others use more simplified outlines and forms. This balance between detail and abstraction enhances the visual interest and invites viewers to appreciate both the meticulous craftsmanship and the broader artistic vision.\n\nOverall, the interplay of these varying styles contributes to a complex and compelling artistic expression, showcasing the versatility and breadth of the artist’s skill and creativity." ``` ### 5.2 Type: Complex Reasoning (18347) **Image** <img src="./analysis/examples/complex-reasoning.png" alt="complex-reasoning" style="width:50%;"> **Data** ``` 'ori_dataset': 'sharegpt4v-web-landmark', 'prompt': 'Why do you think the structure was built so close to the water?', 'prompt_type': 'Complex reasoning based on the image', 'response': 'The structure appears to be a fort or defensive building. Such structures were often built close to the water for several reasons:\n\n1. **Defense**: Proximity to water allowed for better defense against naval attacks and provided a strategic vantage point.\n\n2. **Trade and Transportation**: Being near the water facilitated trade and transportation, allowing easy access for ships to dock and unload goods.\n\n3. **Surveillance**: It provided a clear view of incoming ships, helping in monitoring and controlling maritime activities.\n\n4. **Resource Access**: Access to water was crucial for daily needs and could be used as a natural barrier against enemies.\n\nThese factors made waterfront locations ideal for such constructions.' ``` ### 5.3 Type: Simple Dialogue Communication (18285) **Image** <img src="./analysis/examples/simple-dialogue-communication.png" alt="simple-dialogue-communication" style="width:50%;"> **Data** ``` 'ori_dataset': 'COCO', 'prompt': 'What do you think about the outfits of the teddy bears in the image?', 'prompt_type': 'Simple dialogue communication', 'response': 'The teddy bears in the image are dressed quite charmingly. The bear on the left is wearing a black dress with white polka dots and a white lace collar, giving it a classic and elegant look. The bear on the right is dressed in a more formal outfit, featuring a yellow vest, white shirt, and a black jacket, which makes it look quite dapper. Both outfits are well-coordinated and add a lot of character to the teddy bears.', ``` ### 5.4 Type: Creative Generation (18225) **Image** <img src="./analysis/examples/creative-generation.png" alt="creative-generation" style="width:50%;"> **Data** ``` 'ori_dataset': 'sharegpt4v-textvqa', 'prompt': 'If you were to create a modern version of this advertisement, what elements would you include?', 'prompt_type': 'Creative generation based on the image', 'response': "To create a modern version of this advertisement, you could incorporate the following elements:\n\n1. **Modern Design Aesthetics**:\n - Use clean, minimalist design with contemporary fonts and a cohesive color scheme.\n - Incorporate high-quality images and graphics.\n\n2. **Digital Integration**:\n - Include QR codes that link to the store's website or social media pages.\n - Mention online shopping options and home delivery services.\n\n3. **Updated Visuals**:\n - Use a modern photograph or digital rendering of the store.\n - Include images of the interior, showcasing modern amenities and product displays.\n\n4. **Current Offers and Promotions**:\n - Highlight current promotions, discounts, or loyalty programs.\n - Mention any special events or grand opening activities.\n\n5. **Contact Information and Social Media**:\n - Provide updated contact information, including website, email, and phone number.\n - Include social media handles and encourage following for updates and promotions.\n\n6. **Sustainability and Community Focus**:\n - Emphasize any eco-friendly practices or community involvement.\n - Highlight local products or partnerships with local businesses.\n\n7. **Accessibility and Convenience**:\n - Mention features like ample parking, accessibility for people with disabilities, and convenient location.\n - Highlight any additional services like curbside pickup or extended hours.\n\n8. **Engaging Tagline**:\n - Create a catchy and relevant tagline that resonates with modern consumers.\n\nHere’s a conceptual example of how the modern advertisement might look:\n\n---\n\n**Discover the New Dominion Store!**\n\n**Grand Opening!**\n\n**Modern Shopping Experience with Free Parking**\n\n**Visit Us at 4670 Boulevard Décarie**\n\n**Shop Online at www.dominionstore.com**\n\n**Follow Us on Social Media: @DominionStore**\n\n**Special Offers and Promotions Inside!**\n\n**Eco-Friendly Practices and Local Products**\n\n**Join Our Loyalty Program for Exclusive Discounts**\n\n**Open Daily: 9 AM - 9 PM**\n\n**Contact Us: (123) 456-7890 | info@dominionstore.com**\n\n---\n\nThis approach combines traditional advertising elements with modern design and digital integration to appeal to today's consumers.", ``` ### 5.5 Type: Asking about Image-related Details (VQA) (18191) **Image** <img src="./analysis/examples/VQA.png" alt="VQA" style="width:50%;"> **Data** ``` 'ori_dataset': 'COCO', 'prompt': "What food is left on the boy's plate?", 'prompt_type': 'Asking about image-related details', 'response': 'The plate has some pieces of pancake left on it.' ``` ### 6. Citation Please cite our work if you use the data or model in your paper. ``` @misc{align_anything, author = {PKU-Alignment Team}, title = {Align Anything: training all modality models to follow instructions with unified language feedback}, year = {2024}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/PKU-Alignment/align-anything}}, } ```

# Align-Anything：文本-图像至文本指令遵循子集数据集卡片 ## 文本+图像→文本指令遵循数据集 [[🏠 主页](https://github.com/PKU-Alignment/align-anything)] [[🤗 Align-Anything 数据集](https://huggingface.co/datasets/PKU-Alignment/align-anything/)] [[🦫 Beaver-Vision-11B](https://huggingface.co/PKU-Alignment/Beaver-Vision-11B)] ### 核心亮点 - **输入输出模态**：输入为文本+图像；输出为文本 - **10万级问答对**：通过基于严谨框架的精细化构建，我们共获取103012个问答对，答案均由GPT-4o生成 - **Beaver-Vision-11B**：依托本高质量TI2T指令遵循数据集，我们训练了一款更强大的多模态基础模型[Beaver-Vision-11B](https://huggingface.co/PKU-Alignment/Beaver-Vision-11B)，其性能优于[LLaMA-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) ### 1. 数据集概览 `Align-Anything`是一款涵盖文本、图像、视频、音频等多模态的综合数据集，包含单模态或混合模态的输入输出对，旨在助力全模态基础模型的开发以及多模态对齐领域的研究。本数据集的该子集专注于**文本-图像至文本指令遵循**分支，共包含103012个经过精细标注的问答对。作为输入的图像源自多个知名数据集，包括： - [LAION-5B](https://laion.ai/blog/laion-5b/) - [Lexica](https://lexica.art/) - [LLaVA-Instruct-150K](https://laion.ai/blog/laion-5b/) - [COCO](https://huggingface.co/datasets/detection-datasets/coco) - [VQAv2](https://visualqa.org/) - [LCS-558K](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) - [ShareGPT4V-textvqa](https://github.com/ShareGPT4Omni/ShareGPT4V) - [UnsafeBench](https://huggingface.co/datasets/yiting/UnsafeBench) 我们基于精心设计的提示词（prompt）生成了涵盖多种任务的全新问题，包括视觉问答（Visual Question Answering, VQA）与复杂推理任务，旨在**提升多模态模型的指令遵循能力**。我们使用这些经过微调的提示词由GPT-4o生成回复，并通过迭代的语言反馈标注与严格的筛选流程，最终得到可用于监督微调以增强多模态基础模型性能的高质量数据集。本数据集包含的任务类别示例如下： - **简单对话交互**：旨在与图像展开互动的通用会话类问题，通常表述宽泛，旨在引导模型针对图像给出一般性回复。 - **查询图像相关细节**：针对图像中可见元素或细节的精准问题，需要模型进行细致观察与精确描述，用于检验模型识别并描述图像特定部分的能力。 - **基于图像的复杂推理**：需要基于图像内容进行推理或逻辑推演的问题，旨在促使模型进行批判性思考、分析图像内容并得出逻辑结论或推断新信息。 - **基于图像的创意生成**：引导模型基于图像生成全新创意或想象性内容的任务，旨在提升模型的创意思维能力，以及在不同语境下拓展或重构内容的能力。 - ... ### 2. 使用方法若您希望使用`load_dataset()`加载本数据集，可直接采用如下方式： python from datasets import load_dataset train_dataset = load_dataset('PKU-Alignment/Align-Anything-TI2T-Instruction-100K',split='train') ### 3. 数据集分析 ### 3.1 提示词分布我们对数据集内的提示词分布进行了可视化，发现其在各类别间分布均匀且覆盖广泛。 ![提示词分布](./analysis/prompt_distribution.png) ### 3.2 模型性能依托本高质量TI2T指令遵循数据集，我们训练了一款更强大的多模态基础模型[Beaver-Vision-11B](https://huggingface.co/PKU-Alignment/Beaver-Vision-11B)，其性能优于[LLaMA-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct)。更多性能与训练细节可参考[Beaver-Vision-11B](https://huggingface.co/PKU-Alignment/Beaver-Vision-11B)。 ### 4. 数据字段 | 序号 | 字段名 | 描述 | |-----|-----------------------------|------------------------------------------------------------------------------------------------------------------------------------------| | 0 | `prompt` | 用于引导生成回复的初始问题或主题。 | | 1 | `image` | 与提示词对应的输入图像。 | | 2 | `response` | 输出的文本回复。 | | 3 | `prompt_type` | 输入提示词的任务类别。 | | 4 | `ori_dataset` | 输入图像的来源数据集。 | ### 5. 数据集示例 ### 5.1 类型：指令遵循（29964条） **图像示例** <img src="./analysis/examples/instruction-following.png" alt="instruction-following" style="width:50%;"> **数据示例** 'ori_dataset': 'created-instruction', 'prompt': "How do the varying styles of hand sketches in the image influence the overall artistic expression?", 'prompt_type': 'instruction-following', 'response': "The varying styles of hand sketches in the image create a rich tapestry of artistic expression through multiple dimensions: 1. **Technique and Medium**: Each sketch varies in technique and medium, such as pencil shading, cross-hatching, and line drawing. This diversity showcases different skills and approaches, adding depth and complexity to the overall artistic presentation. 2. **Perspective and Composition**: Different angles and positions of the hands offer a dynamic range of perspectives, from realistic to more abstract forms. These varied compositions provide a multifaceted view of the human hand, emphasizing the artist's versatility. 3. **Light and Shadow**: The use of light and shadow varies across the sketches, from high-contrast, dramatic lighting to softer, more gradual shading. This variation helps in highlighting the texture, form, and depth, contributing to a more engaging and lifelike portrayal. 4. **Emotive Quality**: The portrayal of hands in different gestures and poses can evoke distinct emotions and narratives. Some sketches might convey strength or tension, while others might suggest delicacy or relaxation, enriching the interpretative potential of the artwork. 5. **Detail and Simplification**: The degree of detail varies, with some sketches meticulously capturing intricate lines and skin textures, while others use more simplified outlines and forms. This balance between detail and abstraction enhances the visual interest and invites viewers to appreciate both the meticulous craftsmanship and the broader artistic vision. Overall, the interplay of these varying styles contributes to a complex and compelling artistic expression, showcasing the versatility and breadth of the artist’s skill and creativity." ### 5.2 类型：复杂推理（18347条） **图像示例** <img src="./analysis/examples/complex-reasoning.png" alt="complex-reasoning" style="width:50%;"> **数据示例** 'ori_dataset': 'sharegpt4v-web-landmark', 'prompt': "Why do you think the structure was built so close to the water?", 'prompt_type': 'Complex reasoning based on the image', 'response': 'The structure appears to be a fort or defensive building. Such structures were often built close to the water for several reasons: 1. **Defense**: Proximity to water allowed for better defense against naval attacks and provided a strategic vantage point. 2. **Trade and Transportation**: Being near the water facilitated trade and transportation, allowing easy access for ships to dock and unload goods. 3. **Surveillance**: It provided a clear view of incoming ships, helping in monitoring and controlling maritime activities. 4. **Resource Access**: Access to water was crucial for daily needs and could be used as a natural barrier against enemies. These factors made waterfront locations ideal for such constructions.' ### 5.3 类型：简单对话交互（18285条） **图像示例** <img src="./analysis/examples/simple-dialogue-communication.png" alt="simple-dialogue-communication" style="width:50%;"> **数据示例** 'ori_dataset': 'COCO', 'prompt': "What do you think about the outfits of the teddy bears in the image?", 'prompt_type': 'Simple dialogue communication', 'response': 'The teddy bears in the image are dressed quite charmingly. The bear on the left is wearing a black dress with white polka dots and a white lace collar, giving it a classic and elegant look. The bear on the right is dressed in a more formal outfit, featuring a yellow vest, white shirt, and a black jacket, which makes it look quite dapper. Both outfits are well-coordinated and add a lot of character to the teddy bears.', ### 5.4 类型：创意生成（18225条） **图像示例** <img src="./analysis/examples/creative-generation.png" alt="creative-generation" style="width:50%;"> **数据示例** 'ori_dataset': 'sharegpt4v-textvqa', 'prompt': "If you were to create a modern version of this advertisement, what elements would you include?", 'prompt_type': 'Creative generation based on the image', 'response": "To create a modern version of this advertisement, you could incorporate the following elements: 1. **Modern Design Aesthetics**: - Use clean, minimalist design with contemporary fonts and a cohesive color scheme. - Incorporate high-quality images and graphics. 2. **Digital Integration**: - Include QR codes that link to the store's website or social media pages. - Mention online shopping options and home delivery services. 3. **Updated Visuals**: - Use a modern photograph or digital rendering of the store. - Include images of the interior, showcasing modern amenities and product displays. 4. **Current Offers and Promotions**: - Highlight current promotions, discounts, or loyalty programs. - Mention any special events or grand opening activities. 5. **Contact Information and Social Media**: - Provide updated contact information, including website, email, and phone number. - Include social media handles and encourage following for updates and promotions. 6. **Sustainability and Community Focus**: - Emphasize any eco-friendly practices or community involvement. - Highlight local products or partnerships with local businesses. 7. **Accessibility and Convenience**: - Mention features like ample parking, accessibility for people with disabilities, and convenient location. - Highlight any additional services like curbside pickup or extended hours. 8. **Engaging Tagline**: - Create a catchy and relevant tagline that resonates with modern consumers. Here’s a conceptual example of how the modern advertisement might look: --- **Discover the New Dominion Store!** **Grand Opening!** **Modern Shopping Experience with Free Parking** **Visit Us at 4670 Boulevard Décarie** **Shop Online at www.dominionstore.com** **Follow Us on Social Media: @DominionStore** **Special Offers and Promotions Inside!** **Eco-Friendly Practices and Local Products** **Join Our Loyalty Program for Exclusive Discounts** **Open Daily: 9 AM - 9 PM** **Contact Us: (123) 456-7890 | info@dominionstore.com** --- This approach combines traditional advertising elements with modern design and digital integration to appeal to today's consumers.", ### 5.5 类型：查询图像相关细节（VQA）（18191条） **图像示例** <img src="./analysis/examples/VQA.png" alt="VQA" style="width:50%;"> **数据示例** 'ori_dataset': 'COCO', 'prompt': "What food is left on the boy's plate?", 'prompt_type': 'Asking about image-related details', 'response': 'The plate has some pieces of pancake left on it.' ### 6. 引用声明若您在论文中使用本数据集或模型，请引用我们的工作： @misc{align_anything, author = {PKU-Alignment Team}, title = {Align Anything: training all modality models to follow instructions with unified language feedback}, year = {2024}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {url{https://github.com/PKU-Alignment/align-anything}}, }

提供机构：

maas

创建时间：

2025-02-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集