five

codeforces-submissions

收藏
魔搭社区2026-01-06 更新2025-05-17 收录
下载链接:
https://modelscope.cn/datasets/open-r1/codeforces-submissions
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for CodeForces-Submissions ## Dataset description [CodeForces](https://codeforces.com/) is one of the most popular websites among competitive programmers, hosting regular contests where participants must solve challenging algorithmic optimization problems. The challenging nature of these problems makes them an interesting dataset to improve and test models’ code reasoning capabilities. This dataset includes millions of real user (human) code submissions to the CodeForces website. ## Subsets Different subsets are available: - `default`: all available submissions (covers 9906 problems) - `selected_accepted`: a subset of `default`, with submissions that we were able to execute and that passed all the public tests in [`open-r1/codeforces`](https://huggingface.co/datasets/open-r1/codeforces) (covers 9183 problems) - `selected_incorrect`: a subset of `default`, with submissions that while not having an Accepted verdict, passed at least one of the public test cases from [`open-r1/codeforces`](https://huggingface.co/datasets/open-r1/codeforces). We selected up to 10 per problem, choosing the submissions that passed the most tests (covers 2385 problems) ## Data fields - `submission_id` (str): unique submission ID - `source` (str): source code for this submission. Potentionally modified to run on newer C++/Python versions. - `contestId` (str): the ID of the contest this solution belongs to - `problem_index` (str): Usually, a letter or letter with digit(s) indicating the problem index in a contest - `problem_id` (str): in the format of `contestId/problem_index`. We account for duplicated problems/aliases, this will match a problem in `open-r1/codeforces` - `programmingLanguage` (str): one of `Ada`, `Befunge`, `C# 8`, `C++14 (GCC 6-32)`, `C++17 (GCC 7-32)`, `C++17 (GCC 9-64)`, `C++20 (GCC 11-64)`, `C++20 (GCC 13-64)`, `Clang++17 Diagnostics`, `Clang++20 Diagnostics`, `Cobol`, `D`, `Delphi`, `F#`, `FALSE`, `FPC`, `Factor`, `GNU C`, `GNU C++`, `GNU C++0x`, `GNU C++11`, `GNU C++17 Diagnostics`, `GNU C11`, `Go`, `Haskell`, `Io`, `J`, `Java 11`, `Java 21`, `Java 6`, `Java 7`, `Java 8`, `JavaScript`, `Kotlin 1.4`, `Kotlin 1.5`, `Kotlin 1.6`, `Kotlin 1.7`, `Kotlin 1.9`, `MS C#`, `MS C++`, `MS C++ 2017`, `Mono C#`, `Mysterious Language`, `Node.js`, `OCaml`, `PHP`, `PascalABC.NET`, `Perl`, `Picat`, `Pike`, `PyPy 2`, `PyPy 3`, `PyPy 3-64`, `Python 2`, `Python 3`, `Python 3 + libs`, `Q#`, `Roco`, `Ruby`, `Ruby 3`, `Rust`, `Rust 2021`, `Scala`, `Secret 2021`, `Secret_171`, `Tcl`, `Text`, `Unknown`, `UnknownX`, `null` - `verdict` (str): one of `CHALLENGED`, `COMPILATION_ERROR`, `CRASHED`, `FAILED`, `IDLENESS_LIMIT_EXCEEDED`, `MEMORY_LIMIT_EXCEEDED`, `OK`, `PARTIAL`, `REJECTED`, `RUNTIME_ERROR`, `SKIPPED`, `TESTING`, `TIME_LIMIT_EXCEEDED`, `WRONG_ANSWER` - `testset` (str): Testset used for judging the submission. Can be one of `CHALLENGES`, `PRETESTS`, `TESTS`, or `TESTSxx` - `passedTestCount` (int): the number of test cases this submission passed when it was submitted - `timeConsumedMillis` (int): Maximum time in milliseconds, consumed by solution for one test. - `memoryConsumedBytes` (int): Maximum memory in bytes, consumed by solution for one test. - `creationTimeSeconds` (int): Time, when submission was created, in unix-format. - `original_code` (str): can differ from `source` if it was modified to run on newer C++/Python versions. ## Data sources We compiled real user (human) submissions to the CodeForces website from multiple sources: - [`agrigorev/codeforces-code` (kaggle)](https://www.kaggle.com/datasets/agrigorev/codeforces-code) - [`yeoyunsianggeremie/codeforces-code-dataset` (kaggle)](https://www.kaggle.com/datasets/yeoyunsianggeremie/codeforces-code-dataset/data) - [`MatrixStudio/Codeforces-Python-Submissions`](https://hf.co/datasets/MatrixStudio/Codeforces-Python-Submissions) - [`Jur1cek/codeforces-dataset` (github)](https://github.com/Jur1cek/codeforces-dataset/tree/main) - [miningprogcodeforces](https://sites.google.com/site/miningprogcodeforces/home/dataset?authuser=0) - [itshared.org](https://www.itshared.org/2015/12/codeforces-submissions-dataset-for.html) - [`ethancaballero/description2code` (github)](https://github.com/ethancaballero/description2code) - our own crawling (mostly for recent problems) ## Using the dataset You can load the dataset as follows: ```python from datasets import load_dataset ds = load_dataset("open-r1/codeforces-submissions", split="train") OR ds = load_dataset("open-r1/codeforces-submissions", split="train", name='selected_accepted') ``` See other CodeForces related datasets in [this collection](https://huggingface.co/collections/open-r1/codeforces-68234ed24aa9d65720663bd2). ## License The dataset is licensed under the Open Data Commons Attribution License (ODC-By) 4.0 license. ## Citation If you find CodeForces useful in your work, please consider citing it as: ``` @misc{penedo2025codeforces, title={CodeForces}, author={Guilherme Penedo and Anton Lozhkov and Hynek Kydlíček and Loubna Ben Allal and Edward Beeching and Agustín Piqueres Lajarín and Quentin Gallouédec and Nathan Habib and Lewis Tunstall and Leandro von Werra}, year={2025}, publisher = {Hugging Face}, journal = {Hugging Face repository}, howpublished = {\url{https://huggingface.co/datasets/open-r1/codeforces}} } ```

# CodeForces提交代码数据集卡片 ## 数据集描述 [CodeForces](https://codeforces.com/)是当前竞争编程领域最具影响力的平台之一,定期举办赛事,要求参赛者解决富有挑战性的算法优化类问题。这类问题的高难度特性使其成为训练与测试模型代码推理能力的优质数据集。 本数据集包含百万余条真实用户(人类)在CodeForces平台上的代码提交记录。 ## 子集 提供多种子集可供使用: - `default`:全量可用提交(覆盖9906道题目) - `selected_accepted`:`default`的子集,仅包含可成功运行且通过[`open-r1/codeforces`](https://huggingface.co/datasets/open-r1/codeforces)中所有公开测试用例的提交(覆盖9183道题目) - `selected_incorrect`:`default`的子集,包含未获得Accepted判定但至少通过了`open-r1/codeforces`中一道公开测试用例的提交。我们为每道题目最多选取10条提交,选择通过测试用例数最多的那些(覆盖2385道题目) ## 数据字段 - `submission_id`(字符串):唯一提交标识 - `source`(字符串):本次提交的源代码,为适配新版C++/Python版本可能已进行修改 - `contestId`(字符串):该解题代码所属竞赛的ID - `problem_index`(字符串):通常为字母或带数字的字母组合,用于标识竞赛内的题目编号 - `problem_id`(字符串):格式为`contestId/problem_index`。考虑到题目重复与别名问题,该字段将与`open-r1/codeforces`中的题目一一对应 - `programmingLanguage`(字符串):可选编程语言列表如下:`Ada`、`Befunge`、`C# 8`、`C++14 (GCC 6-32)`、`C++17 (GCC 7-32)`、`C++17 (GCC 9-64)`、`C++20 (GCC 11-64)`、`C++20 (GCC 13-64)`、`Clang++17 Diagnostics`、`Clang++20 Diagnostics`、`Cobol`、`D`、`Delphi`、`F#`、`FALSE`、`FPC`、`Factor`、`GNU C`、`GNU C++`、`GNU C++0x`、`GNU C++11`、`GNU C++17 Diagnostics`、`GNU C11`、`Go`、`Haskell`、`Io`、`J`、`Java 11`、`Java 21`、`Java 6`、`Java 7`、`Java 8`、`JavaScript`、`Kotlin 1.4`、`Kotlin 1.5`、`Kotlin 1.6`、`Kotlin 1.7`、`Kotlin 1.9`、`MS C#`、`MS C++`、`MS C++ 2017`、`Mono C#`、`Mysterious Language`、`Node.js`、`OCaml`、`PHP`、`PascalABC.NET`、`Perl`、`Picat`、`Pike`、`PyPy 2`、`PyPy 3`、`PyPy 3-64`、`Python 2`、`Python 3`、`Python 3 + libs`、`Q#`、`Roco`、`Ruby`、`Ruby 3`、`Rust`、`Rust 2021`、`Scala`、`Secret 2021`、`Secret_171`、`Tcl`、`Text`、`Unknown`、`UnknownX`、`null` - `verdict`(字符串):提交判定结果,可选值包括:`CHALLENGED`(被质疑)、`COMPILATION_ERROR`(编译错误)、`CRASHED`(程序崩溃)、`FAILED`(失败)、`IDLENESS_LIMIT_EXCEEDED`(闲置超时)、`MEMORY_LIMIT_EXCEEDED`(内存超限)、`OK`(答案正确)、`PARTIAL`(部分通过)、`REJECTED`(被拒绝)、`RUNTIME_ERROR`(运行时错误)、`SKIPPED`(跳过)、`TESTING`(测试中)、`TIME_LIMIT_EXCEEDED`(时间超限)、`WRONG_ANSWER`(答案错误) - `testset`(字符串):用于评判提交的测试集类型,可选值为`CHALLENGES`(质疑测试)、`PRETESTS`(预测试)、`TESTS`(正式测试)或`TESTSxx` - `passedTestCount`(整数):提交时该代码通过的测试用例数量 - `timeConsumedMillis`(整数):单条测试用例运行时消耗的最大时间,单位为毫秒 - `memoryConsumedBytes`(整数):单条测试用例运行时消耗的最大内存,单位为字节 - `creationTimeSeconds`(整数):提交创建的时间,采用秒级Unix时间戳格式 - `original_code`(字符串):若源代码为适配新版C++/Python版本而修改,则与`source`字段内容不同,否则二者一致 ## 数据来源 本数据集从多个渠道收集了CodeForces平台上的真实用户(人类)提交记录: - [`agrigorev/codeforces-code`(Kaggle)](https://www.kaggle.com/datasets/agrigorev/codeforces-code) - [`yeoyunsianggeremie/codeforces-code-dataset`(Kaggle)](https://www.kaggle.com/datasets/yeoyunsianggeremie/codeforces-code-dataset/data) - [`MatrixStudio/Codeforces-Python-Submissions`](https://hf.co/datasets/MatrixStudio/Codeforces-Python-Submissions) - [`Jur1cek/codeforces-dataset`(GitHub)](https://github.com/Jur1cek/codeforces-dataset/tree/main) - [miningprogcodeforces](https://sites.google.com/site/miningprogcodeforces/home/dataset?authuser=0) - [itshared.org](https://www.itshared.org/2015/12/codeforces-submissions-dataset-for.html) - [`ethancaballero/description2code`(GitHub)](https://github.com/ethancaballero/description2code) - 自主爬取数据(主要针对近年题目) ## 数据集使用 可通过以下方式加载该数据集: python from datasets import load_dataset ds = load_dataset("open-r1/codeforces-submissions", split="train") 或 ds = load_dataset("open-r1/codeforces-submissions", split="train", name='selected_accepted') 可在[此合集](https://huggingface.co/collections/open-r1/codeforces-68234ed24aa9d65720663bd2)中查看其他与CodeForces相关的数据集。 ## 授权协议 本数据集采用开放数据 Commons 署名许可(Open Data Commons Attribution License, ODC-By)4.0版本协议进行授权。 ## 引用信息 若您的工作中使用了本CodeForces数据集,请引用如下内容: @misc{penedo2025codeforces, title={CodeForces}, author={Guilherme Penedo and Anton Lozhkov and Hynek Kydlíček and Loubna Ben Allal and Edward Beeching and Agustín Piqueres Lajarín and Quentin Gallouédec and Nathan Habib and Lewis Tunstall and Leandro von Werra}, year={2025}, publisher = {Hugging Face}, journal = {Hugging Face repository}, howpublished = {url{https://huggingface.co/datasets/open-r1/codeforces}} }
提供机构:
maas
创建时间:
2025-04-24
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集包含来自CodeForces竞赛平台的数百万真实用户代码提交,旨在提升和测试模型的代码推理能力。它提供了多个子集,涵盖不同测试状态的提交,数据字段详细记录了提交ID、源代码、编程语言和测试结果等信息,适用于算法代码分析和机器学习研究。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作