codeforces
收藏魔搭社区2026-04-27 更新2025-03-15 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/codeforces
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for CodeForces
## Dataset description
[CodeForces](https://codeforces.com/) is one of the most popular websites among competitive programmers, hosting regular contests where participants must solve challenging algorithmic optimization problems. The challenging nature of these problems makes them an interesting dataset to improve and test models’ code reasoning capabilities. This dataset includes more than **10k unique problems** covering the very first contests all the way to 2025. Additionally, we generated and extensively validated custom checker code for problems with multiple solutions, and generated additional challenging test cases.
## Why most datasets are actually not "verifiable"
Competitive programming problems require solutions to solve any valid test case within the input constraints within certain time and memory limits. Usually, the most challenging test cases requiring truly optimized (and correct) solutions consist of **very large test inputs**.
While contest platforms like CodeForces let users see test cases, they **truncate** them to at most \~400 characters. This means that the problems on "verifiable" datasets often only contain the most simple, trivial test cases that can be solved with an easy brute force solution.
Additionally, many problems allow for multiple correct answers, and therefore require a special program, a "checker" to validate if a user submitted answer is correct (~30% of problems).
Fully verifable problems thus require two things: being able to properly validate test cases (checkers), and having challenging test cases that ensure solutions are correct.
## What sets this dataset apart
Besides covering more problems than previous efforts, this dataset:
- **editorials**: includes *editorials*, an explanation, written by the contest organizers, explaining the correct solution (60% of problems)
- **latex images**: has proper text versions of latex rendered equations (images), OCR'd with `Qwen/Qwen2.5-VL-7B-Instruct`
- **checkers**: to make sure we could evaluate problems with multiple correct possible answers, we:
1. Used real human contestant solutions, we made sure problems were executable
2. With those solutions, we determined which problems might require a custom checker (when some correct solutions produce a result that does not match the "correct" output)
3. Generated 20-100 custom checkers using DeepSeek-R1 (depending on the problem)
4. Tested them until one properly validated the real human correct solutions (judged their outputs to be correct)
- **additional test cases**: to ensure proper verifiability, we generated additional test cases for problems whose public test cases had been truncated:
1. Had DeepSeek-R1 create test case generators covering tricky edge cases and making full use of the input limits. Coming up with test cases is considerably easier than fully solving the problem
2. Used one of the correct human contestant solutions to obtain the "correct output" of each test case
3. Run multiple correct human solutions through the test case input+output, and removed test cases where the correct solutions did not agree on the result
4. Using some incorrect solutions (when available) that passed multiple public tests (and sometimes even all public test cases), we ordered the generated test cases by how hard they are (e.g., if all incorrect solutions solve it, then it's quite easy) and also by the size of the input, under the assumption that larger inputs correspond to harder test cases
5. Selected the top hardest (according to our heuristic) test cases, as well as some other randomly sampled test cases of differing difficulties, to ensure intermediate rewards as model solutions improve
## Subsets
We have split the dataset into two subsets:
- `default`: all problems (~10k problems)
- `verifiable`: only problems that are `executable` AND that either have `official_tests_complete` or some `generated_tests` available (8760 problems)
- `verifiable-prompts`: same as `verifiable` but with 2 prompts for generation per problem (one in python and one in cplusplus). See columns `language` and `prompt`
## Downloading generated tests
Due to their large size (~110GB), you need to download the generated test cases separately:
```bash
pip install -U huggingface_hub[cli,hf_xet]
# change PATH_TO_SAVE_TESTCASES. Increase --max-workers according to your machine's capacity
huggingface-cli download open-r1/codeforces --repo-type=dataset --include='generated_tests/*.parquet' --max-workers=8 --local-dir PATH_TO_SAVE_TESTCASES
```
Test cases are split per contest and named `test_cases_XXXX.parquet` where X is the contest ID. Each parquet file has 4 columns: `problem_id` (to be matched with `id` in the main dataset), `input` (str), `output` (str) and `test_case_i` (int).
## Splits
We provide a `test` split with problems from late 2024 and early 2025. Please avoid training on these.
## Data fields
### General problem/contest fields
- `id` (str): unique submission ID
- `aliases` (list[str]): list of other problem IDs that are copies of this one (quite common between Div 1. and Div 2. contests, where some problems are in both)
- `index` (str): usually a letter, indicating the index of this problem in the contest
- `contest_id` (str): the ID of the contest this problem belongs to
- `contest_name` (str): name of the contest this problem belongs to
- `contest_type` (str): 'ICPC', 'CF' or 'IOI'
- `contest_start` (int): Contest start time in unix format
- `contest_start_year` (int): year of the contest
### Problem statement fields
- `time_limit` (float): execution time limit for each test case, in seconds
- `memory_limit` (float): execution memory limit for each test case, in megabytes
- `title` (str): problem title
- `description` (str): main problem description
- `input_format` (str): explanation of how test case inputs will be structured
- `output_format` (str): explanation of how test case outputs should be structured
- `interaction_format` (str): (for interactive problems only, which our execution currently does not support): describes how the program should interact with the grader (which we do not have)
- `note` (str): short explanation of how the examples are solved
- `examples` (list[{input: str, output: str}]): example test cases that are shown in the problem statement
### Problem metadata
- `editorial` (str): explanation of the solution from the original problem authors (when available)
- `rating` (str): problem rating (difficulty).
- `tags` (list[str]): problem tags
### Test and execution data
- `testset_size` (int): number of tests in the full testset. This is obtained from the `passedTestCount` value of correct solutions to this problem
- `official_tests` (list[{input: str, output: str}]): all test cases from CodeForces that are not **truncated** (< 400 chars)
- `official_tests_complete` (bool): whether we have all the official test cases for this problem (no test case was truncated)
- `input_mode` (str): 'stdio' or 'file'. How the input and output should be sent to the program. stdio=standard input. file=read from 'input.txt' and write to 'output.txt'
- `generated_checker` (str): python program (checker.py) that should be run with `python checker.py input.txt correct_output.txt solution_output.txt` to validate the solution, when there are multiple possible answers to this problem (prints 0-1 or 0-100 to stdout)
- `executable` (bool): whether we have been able to run at least 3 human created solutions to this problem and had them pass the `official_tests` (submissions are in `open-r1/codeforces-submissions`)
- `generated_tests` (int): number of generated and validated additional tests created for improved verifiability. See "Downloading generated tests" above.
### RL/generation related fields
For the `verifiable-prompts` subset only:
- `language`: `python` or `cpp`. Each problem has a row with `python` and another with `cpp`
- `prompt`: fully formatted prompt asking the model to generate a solution that solves the problem within the given time and memory constraints, in language `language`. Ready to use for RL
## Loading the dataset
```python
from datasets import load_dataset
ds = load_dataset("open-r1/codeforces-submissions", split="train", name='default')
OR
ds = load_dataset("open-r1/codeforces-submissions", split="train", name='verifiable')
```
See other CodeForces related datasets in [this collection](https://huggingface.co/collections/open-r1/codeforces-68234ed24aa9d65720663bd2).
## Verifying problems
We recommend using our [**compile**](https://github.com/guipenedo/piston/blob/master/packages/codeforces/1.0.0/compile) and [**run**](https://github.com/guipenedo/piston/blob/master/packages/codeforces/1.0.0/run) scripts developed specifically for this dataset.
Here's an example of the payload to evaluate a problem using [piston](https://github.com/huggingface/open-r1/blob/main/slurm/piston/README.md), which runs these two scripts under the hood:
```python
source_code = "..." # source_code is the model generated code
endpoint = "http://piston_endpoint:piston_port"
extension, piston_language = "cpp", "cf_c++17"
# problem_data is a row from this dataset
test_case = problem_data['official_tests'][0] # if this problem also has generated_tests, you should run those too
payload = {
"language": piston_language,
"version": "*",
"files": [
{
"name": f"main.{extension}",
"content": source_code
},
{
"name": "input.txt",
"content": test_case['input']
},
{
"name": "correct_output.txt",
"content": test_case['output']
},
*([{"name": "checker.py", "content": problem_data['generated_checker']}] if problem_data['generated_checker'] else []),
{
"name": "grader_config",
"content": "\n".join(
f"{key}={value}" for key, value in {
"TIME_LIMIT": problem_data['time_limit'],
"MEMORY_LIMIT": problem_data['memory_limit'],
"INPUT_MODE": problem_data['input_mode']
}.items()
)
}
]
}
result = requests.post(f"{endpoint}/execute", json=payload, headers={"Content-Type": "application/json"})
# example correct result:
# {
# "compile": {
# "stdout": "",
# "stderr": "",
# "code": 0,
# "signal": null,
# "output": ""
# },
# "run": {
# "stdout": "1\n", <- this is the actual score. 0=wrong/TLE/MLE; 1=correct
# "stderr": "Output is correct (diff)\n",
# "code": 0,
# "signal": null,
# "output": "1\nOutput is correct (diff)\n"
# },
# "language": "c++",
# "version": "1.0.0"
# }
# example incorrect solution:
# {
# "compile": {
# "stdout": "Skipping compile - python3\n",
# "stderr": "",
# "code": 0,
# "signal": null,
# "output": "Skipping compile - python3\n"
# },
# "run": {
# "stdout": "0\n",
# "stderr": "Output isn't correct (checker)\n",
# "code": 0,
# "signal": null,
# "output": "0\nOutput isn't correct (checker)\n"
# },
# "language": "python3",
# "version": "1.0.0"
# }
result_is_correct = result and 'compile' in result and result['compile']['code'] == 0 and result['run']['code'] == 0 and result['run']['stdout'].split() == '1'
```
## License
The dataset is licensed under the Open Data Commons Attribution License (ODC-By) 4.0 license.
## Citation
If you find CodeForces useful in your work, please consider citing it as:
```
@misc{penedo2025codeforces,
title={CodeForces},
author={Guilherme Penedo and Anton Lozhkov and Hynek Kydlíček and Loubna Ben Allal and Edward Beeching and Agustín Piqueres Lajarín and Quentin Gallouédec and Nathan Habib and Lewis Tunstall and Leandro von Werra},
year={2025},
publisher = {Hugging Face},
journal = {Hugging Face repository},
howpublished = {\url{https://huggingface.co/datasets/open-r1/codeforces}}
}
```
# CodeForces 数据集卡片
## 数据集描述
[CodeForces](https://codeforces.com/) 是竞赛编程领域最热门的平台之一,定期举办赛事,要求参赛者解决极具挑战性的算法优化类问题。这些问题的高难度特性使其成为训练和测试模型代码推理能力的优质数据集。本数据集包含超过**10k独特题目**,涵盖从首届赛事直至2025年的全部竞赛内容。此外,我们针对存在多组正确解的题目开发并全面验证了自定义校验代码,同时生成了更多高难度测试用例。
## 大多数数据集实则不具备"可验证性"的原因
竞赛编程题目要求解法在给定的时间与内存限制内,正确求解所有合法输入用例。其中,真正需要经过深度优化(且正确)的解法才能通过的高难度测试用例,往往对应**规模极大的输入数据**。
尽管CodeForces等竞赛平台允许用户查看测试用例,但平台会将用例**截断**至最多约400个字符。这意味着,标注为"可验证"的数据集通常仅包含最简单、可通过暴力枚举轻松解决的测试用例。
此外,大量题目允许多种正确答案,因此需要专门的"校验程序"来判断用户提交的解法是否正确(约30%的题目属于此类)。
因此,真正具备可验证性的题目需要满足两个条件:一是能够对测试用例进行正确校验(即具备校验代码),二是拥有能够确保解法正确性的高难度测试用例。
## 本数据集的独特优势
相较于此前的同类数据集,本数据集不仅覆盖了更多题目,还具备以下特性:
- **题解解析**:包含由赛事主办方撰写的官方题解(覆盖60%的题目),用于讲解正确解法
- **LaTeX公式转写**:将渲染为图片的LaTeX公式转换为可编辑文本,通过`Qwen/Qwen2.5-VL-7B-Instruct`进行OCR识别
- **校验代码**:为了支持对多解题目进行校验,我们完成了以下工作:
1. 采用真实的参赛选手解法,确保题目可被正常执行
2. 基于这些解法,识别出需要自定义校验代码的题目(即部分正确解法的输出与"标准答案"不一致的情况)
3. 使用DeepSeek-R1生成20至100个自定义校验代码(根据题目复杂度调整数量)
4. 对生成的校验代码进行测试,直至找到能够正确验证真实参赛选手正确解法的代码(即判定其输出为正确结果)
- **额外测试用例**:为确保可验证性,我们针对公开测试用例被截断的题目生成了额外测试用例,流程如下:
1. 使用DeepSeek-R1生成测试用例生成器,覆盖复杂边界情况并充分利用输入限制条件。相较于完全解决题目,生成测试用例的难度更低
2. 使用某一正确的参赛选手解法,获取每个测试用例的"标准答案输出"
3. 将多个正确的参赛选手解法代入测试用例的输入与输出,移除那些不同正确解法结果不一致的测试用例
4. 利用部分通过了多个公开测试(有时甚至是全部公开测试)的错误解法,按照难度(例如:若所有错误解法都能通过该用例,则该用例难度较低)与输入规模进行排序,我们假设输入规模越大,测试用例难度越高
5. 选取按启发式规则得到的难度最高的测试用例,以及部分不同难度的随机采样用例,以确保随着模型解法能力提升,能获得分层的奖励信号
## 数据集子集划分
我们将数据集划分为以下三个子集:
- `default`:全部题目(约10000道)
- `verifiable`:仅包含**可执行**且具备完整官方测试用例(`official_tests_complete`为真)或生成测试用例的题目(共8760道)
- `verifiable-prompts`:与`verifiable`子集一致,但为每道题目提供2个生成式提示(分别对应Python与C++语言)。可通过字段`language`与`prompt`获取相关内容
## 下载生成的测试用例
由于生成的测试用例体积较大(约110GB),您需要单独下载:
bash
pip install -U huggingface_hub[cli,hf_xet]
# 修改PATH_TO_SAVE_TESTCASES,可根据机器性能调整--max-workers参数
huggingface-cli download open-r1/codeforces --repo-type=dataset --include='generated_tests/*.parquet' --max-workers=8 --local-dir PATH_TO_SAVE_TESTCASES
测试用例按赛事划分,文件名为`test_cases_XXXX.parquet`,其中XXXX为赛事ID。每个Parquet文件包含4个字段:`problem_id`(需与主数据集中的`id`字段匹配)、`input`(输入字符串)、`output`(输出字符串)与`test_case_i`(整数类型的测试用例序号)。
## 数据集划分
我们提供了`test`划分,包含2024年末至2025年初的题目。请避免在训练过程中使用该划分的数据。
## 数据字段
### 通用题目/赛事字段
- `id` (str):唯一题目ID
- `aliases` (list[str]):该题目的其他别名ID(在Div 1与Div 2赛事中较为常见,部分题目会同时出现在两类赛事中)
- `index` (str):通常为字母,代表该题目在赛事中的序号
- `contest_id` (str):该题目所属赛事的ID
- `contest_name` (str):该题目所属赛事的名称
- `contest_type` (str):赛事类型,可选值为'ICPC'、'CF'或'IOI'
- `contest_start` (int):赛事开始时间,采用Unix时间戳格式
- `contest_start_year` (int):赛事举办年份
### 题目描述字段
- `time_limit` (float):每个测试用例的执行时间限制,单位为秒
- `memory_limit` (float):每个测试用例的执行内存限制,单位为兆字节
- `title` (str):题目标题
- `description` (str):题目主体描述
- `input_format` (str):测试用例输入格式的说明
- `output_format` (str):测试用例输出格式的说明
- `interaction_format` (str):(仅适用于交互型题目,当前执行环境暂不支持):描述程序与评测器的交互方式(本数据集暂不支持此类题目)
- `note` (str):示例解法的简短说明
- `examples` (list[{input: str, output: str}]):题目描述中给出的示例测试用例
### 题目元数据
- `editorial` (str):题目原作者提供的解法说明(若可用)
- `rating` (str):题目难度评级
- `tags` (list[str]):题目标签
### 测试与执行数据
- `testset_size` (int):完整测试集的测试用例数量,该数值来自该题目的所有正确解法的`passedTestCount`字段
- `official_tests` (list[{input: str, output: str}]):CodeForces平台上未被截断(长度≥400字符)的全部官方测试用例
- `official_tests_complete` (bool):是否已获取该题目的所有官方测试用例(无测试用例被截断)
- `input_mode` (str):'stdio'或'file',代表程序输入输出的方式:stdio为标准输入输出;file为从'input.txt'读取输入并将输出写入'output.txt'
- `generated_checker` (str):Python程序(checker.py),当题目存在多种正确答案时,需使用`python checker.py input.txt correct_output.txt solution_output.txt`运行该程序以验证解法正确性(程序将向标准输出打印0-1或0-100的分值)
- `executable` (bool):是否已成功运行至少3份人类创作的该题目的正确解法,并使其通过`official_tests`测试(提交的解法存放于`open-r1/codeforces-submissions`)
- `generated_tests` (int):为提升可验证性而生成并验证的额外测试用例数量,详见上文"下载生成的测试用例"部分
### 强化学习/生成相关字段
仅适用于`verifiable-prompts`子集:
- `language`:`python`或`cpp`,每道题目对应两行数据,分别为Python与C++语言版本
- `prompt`:格式完整的提示词,要求模型生成符合给定时间与内存限制的解法,语言为`language`字段指定的语言,可直接用于强化学习训练
## 加载数据集
python
from datasets import load_dataset
ds = load_dataset("open-r1/codeforces-submissions", split="train", name='default')
OR
ds = load_dataset("open-r1/codeforces-submissions", split="train", name='verifiable')
可在[此合集](https://huggingface.co/collections/open-r1/codeforces-68234ed24aa9d65720663bd2)中查看其他与CodeForces相关的数据集。
## 验证题目
我们推荐使用专为该数据集开发的[**编译**](https://github.com/guipenedo/piston/blob/master/packages/codeforces/1.0.0/compile)与[**运行**](https://github.com/guipenedo/piston/blob/master/packages/codeforces/1.0.0/run)脚本。以下为使用[piston](https://github.com/huggingface/open-r1/blob/main/slurm/piston/README.md)(其底层运行上述两个脚本)评估题目的示例代码:
python
source_code = "..." # 模型生成的源代码
endpoint = "http://piston_endpoint:piston_port"
extension, piston_language = "cpp", "cf_c++17"
# problem_data 为数据集中的一行数据
test_case = problem_data['official_tests'][0] # 若该题目同时拥有生成的测试用例,请一并运行这些用例
payload = {
"language": piston_language,
"version": "*",
"files": [
{
"name": f"main.{extension}",
"content": source_code
},
{
"name": "input.txt",
"content": test_case['input']
},
{
"name": "correct_output.txt",
"content": test_case['output']
},
*([{"name": "checker.py", "content": problem_data['generated_checker']}] if problem_data['generated_checker'] else []),
{
"name": "grader_config",
"content": "
".join(
f"{key}={value}" for key, value in {
"TIME_LIMIT": problem_data['time_limit'],
"MEMORY_LIMIT": problem_data['memory_limit'],
"INPUT_MODE": problem_data['input_mode']
}.items()
)
}
]
}
result = requests.post(f"{endpoint}/execute", json=payload, headers={"Content-Type": "application/json"})
# 正确结果示例:
# {
# "compile": {
# "stdout": "",
# "stderr": "",
# "code": 0,
# "signal": null,
# "output": ""
# },
# "run": {
# "stdout": "1
", <- 实际分值:0=错误/超时/内存超限;1=正确
# "stderr": "Output is correct (diff)
",
# "code": 0,
# "signal": null,
# "output": "1
Output is correct (diff)
"
# },
# "language": "c++",
# "version": "1.0.0"
# }
# 错误解法示例:
# {
# "compile": {
# "stdout": "Skipping compile - python3
",
# "stderr": "",
# "code": 0,
# "signal": null,
# "output": "Skipping compile - python3
"
# },
# "run": {
# "stdout": "0
",
# "stderr": "Output isn't correct (checker)
",
# "code": 0,
# "signal": null,
# "output": "0
Output isn't correct (checker)
"
# },
# "language": "python3",
# "version": "1.0.0"
# }
result_is_correct = result and 'compile' in result and result['compile']['code'] == 0 and result['run']['code'] == 0 and result['run']['stdout'].split() == '1'
## 许可证
本数据集采用开放数据 Commons 署名许可协议(Open Data Commons Attribution License, ODC-By)4.0版本进行授权。
## 引用
若您的工作中使用了本CodeForces数据集,请引用如下:
@misc{penedo2025codeforces,
title={CodeForces},
author={Guilherme Penedo and Anton Lozhkov and Hynek Kydlíček and Loubna Ben Allal and Edward Beeching and Agustín Piqueres Lajarín and Quentin Gallouédec and Nathan Habib and Lewis Tunstall and Leandro von Werra},
year={2025},
publisher = {Hugging Face},
journal = {Hugging Face repository},
howpublished = {url{https://huggingface.co/datasets/open-r1/codeforces}}
}
提供机构:
maas
创建时间:
2025-03-18
搜集汇总
数据集介绍

背景与挑战
背景概述
CodeForces数据集是一个包含10k以上独特竞争性编程问题的集合,特别强调问题的可验证性,包括自定义检查器和额外测试用例。数据集还提供编辑说明、LaTeX方程文本版本及多语言提示,适合提升模型代码推理能力。
以上内容由遇见数据集搜集并总结生成



