likaixin/MMCode

Name: likaixin/MMCode
Creator: likaixin
Published: 2024-04-16 02:36:49
License: 暂无描述

Hugging Face2024-04-16 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/likaixin/MMCode

下载链接

链接失效反馈

官方服务：

资源简介：

MMCode是一个多模态代码生成数据集，旨在评估代码语言模型在视觉丰富环境中的问题解决能力。数据集包含3,548个问题，配以6,620张图像，这些问题来源于10个代码竞赛网站，并提供了Python解决方案和测试用例。数据集强调了推理能力的极端需求、文本和视觉内容的交织性以及包含多个图像的问题的出现。数据集的主要语言是英语，部分问题从日语翻译而来，编程语言为Python。每个问题包括编程问题、相关图像、Python解决方案、测试用例和其他元数据。数据集的关键特征包括多模态挑战、丰富的多样性、详细的注释和自动测试框架。数据集的测试集由263个问题组成，用于高效评估。数据收集过程包括从10个编码平台收集问题和图像，通过自动化过滤和人工审查确保数据质量，并对图像进行分类注释。数据来源和许可证信息也提供了详细的列表。

MMCode is a multimodal code generation dataset intended to evaluate the problem-solving abilities of code language models in visually-rich environments. The dataset contains 3,548 problems paired with 6,620 images sourced from 10 coding competition websites, and provides Python solutions and test cases. This dataset emphasizes the extreme demand for reasoning capabilities, the interweaving of textual and visual content, and the occurrence of problems involving multiple images. The primary language of the dataset is English, with some problems translated from Japanese, and the programming language used is Python. Each problem includes a programming problem statement, relevant images, Python solutions, test cases, and other metadata. The key features of the dataset include multimodal challenges, rich diversity, detailed annotations, and an automated testing framework. The test split of the dataset consists of 263 problems for efficient evaluation. The data collection process involves collecting problems and images from 10 coding platforms, ensuring data quality through automated filtering and manual review, and performing categorized annotations on the images. Detailed lists of data sources and license information are also provided.

提供机构：

likaixin

原始信息汇总

数据集描述

MMCode 是一个多模态代码生成数据集，旨在评估代码语言模型在视觉丰富上下文（即图像）中的问题解决能力。该数据集包含 3,548 个问题与 6,620 张图像，源自 10 个代码竞赛网站，提供 Python 解决方案和测试。数据集强调了对推理能力的极高要求、文本和视觉内容的交织性，以及包含多张图像的问题的出现。

语言

数据集内容的主要语言是英语，部分问题从日语翻译而来（原始数据提供参考）。
编程语言为 Python。

数据集结构

每个问题包括一个编程问题、相关图像、Python 解决方案、测试用例和其他元数据。

数据字段

字段	类型	描述
problem_id	string	问题的唯一标识符，例如 cf_1_A。
name	string	从问题中爬取的编程问题的描述性标题。
source	string	问题的来源平台。
url	string	指向源网站上原始问题的链接。
question	string	编程挑战的文本描述，概述问题背景、目标和约束条件。图像以 Markdown 标签表示（例如）。
raw_problem	string	包含问题描述的原始 HTML，包括在线版本的格式和嵌入图像标签。
solutions	list[string]	表示问题解决方案的字符串列表。
input_output	dict[string, list]	包含测试输入和预期输出的字典。
images	list[string]	与问题相关的 base64 图像数组。
picture_num	int	问题中的图像数量。
image_tags	list	对问题相关图像进行分类的列表。
starter_code*	string	问题的初始代码。如果没有提供初始代码，则为空字符串。
difficulty*	string	问题难度级别的分类。
raw_tags*	list	与问题相关的原始标签。
tags*	list	表示问题涉及的各种类别或概念的字符串列表。
skill_types*	list	根据解决问题所需的技能或知识领域对问题进行分类的字符串列表。
Expected Time Complexity*	string	描述解决方案的预期时间复杂度（如果适用）。
Expected Auxiliary Space*	string	关于解决方案辅助空间需求的信息（如果适用）。
time_limit*^	string	解决方案的最大允许执行时间。
memory_limit*^	string	解决方案的最大允许内存使用量。

字段继承自 TACO 数据集。

^ 这些限制由于硬件差异未被测试框架考虑。

关键特性

多模态挑战：MMCode 是首个结合文本和视觉信息的代码生成工作，要求模型解释和整合两种数据模态以解决问题。
丰富多样性：包含 3,548 个问题与 6,620 张图像，源自 10 个不同的代码竞赛网站，提供多样化的真实编程挑战。
详细注释：数据集包括对图像的详细注释，将它们分类为“线性数据结构”、“树”、“图”、“2D 几何”、“3D 几何”、“棋盘”、“地图”、“模式”、“数学”、“表格”、“伪代码”和“其他”，允许对不同视觉信息类型的模型性能进行详细分析。
自动测试框架：每个问题都附带输入输出对，作为自动测试用例，严格评估模型提供的解决方案的正确性。

数据分割

测试集由 263 个问题组成，以便进行高效评估。

数据示例

json { "solutions": "["(X, Y, Z) = map(int, input().split( )) ... "]", "starter_code": "", "input_output": "{"inputs": ["13 3 1\n", "12 3 1\n", "100000 1 1\n"], "outputs": ["3\n", "2\n", "49999\n"]}", "difficulty": "EASY", "raw_tags": "[]", "name": "AtCoder Beginner Contest 078 - ISU", "source": "atcoder", "tags": "[]", "skill_types": "[]", "url": "https://atcoder.jp/contests/abc078/tasks/abc078_b", "Expected Auxiliary Space": null, "time_limit": "2.0 seconds", "memory_limit": "256.0 megabytes", "Expected Time Complexity": null, "raw_problem": "<div id="task-statement"> <span class="lang">...There is just enough room for three, as shown below:</p> <div style="text-align: center;"> <img src="https://img.atcoder.jp/abc078/4a35302937c3cbc2f625156e7834d27f.png"> <p>Figure</p> </img>...</div>", "question": "We have a long seat of width $X$ centimeters. There are many people who wants to sit here. A person sitting on the seat will always occupy an interval of length $Y$ centimeters. We would like to seat as many people as possible, but they are all very shy, and there must be a gap of length at least $Z$ centimeters between two people, and between the end of the seat and a person. At most how many people can sit on the seat?

Constraints

All input values are integers.
$1 \leq X, Y, Z \leq 10^5$
$Y+2Z \leq X$

Input Input is given from Standard Input in the following format: $X$ $Y$ $Z$

Output Print the answer.

Sample Input 1 13 3 1

Sample Output 1 3

There is just enough room for three, as shown below:

Figure

Sample Input 2 12 3 1

Sample Output 2 2

Sample Input 3 100000 1 1

Sample Output 3 49999

Sample Input 4 64146 123 456

Sample Output 4 110

Sample Input 5 64145 123 456

Sample Output 5 109", "images": "[<base 64 image>]", "picture_num": 1, "image_tags": [ "Demonstration" ], "data_split": "core_test" }

搜集汇总

数据集介绍

背景与挑战

背景概述

MMCode是一个多模态代码生成数据集，包含3,548个问题与6,620张图像的配对，源自10个代码竞赛网站，提供Python解决方案和测试。该数据集强调推理能力、文本与视觉内容的交织性，适用于评估模型在多模态环境下的表现。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集