collinear-ai/yc-bench
收藏Hugging Face2026-03-23 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/collinear-ai/yc-bench
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: "YC-Bench"
language:
- en
license: apache-2.0
size_categories:
- n<1K
task_categories:
- text-generation
tags:
- benchmark
- agents
- long-horizon
- simulation
- evaluation
citation: |
@misc{collinear-ai2025ycbench,
author = {{Collinear AI}},
title = {{YC-Bench}: Your Company Bench — A Long-Horizon Coherence Benchmark for {LLM} Agents},
year = {2025},
howpublished = {\url{https://github.com/collinear-ai/yc-bench}}}
---
# YC-Bench
Long-horizon agent benchmark. The LLM plays CEO of an AI startup for 1 simulated year via CLI tool use against a deterministic discrete-event simulation.
Tests: employee allocation, prestige specialization, cash flow, deadline risk, adversarial client detection — sustained over hundreds of turns.
Source: [github.com/collinear-ai/yc-bench](https://github.com/collinear-ai/yc-bench)
## Evaluation
Download `run_yc_bench_job.py` from this repo, then:
```bash
hf jobs uv run run_yc_bench_job.py \
--flavor cpu-basic --timeout 3h \
--secrets OPENAI_API_KEY \
-- openai/gpt-5.4
```
Or run locally: `uv run run_yc_bench_job.py openai/gpt-5.4`
Runs medium preset on seeds 1-3 and reports average final funds. Pass the appropriate `--secrets` flag for your provider (`ANTHROPIC_API_KEY`, `OPENROUTER_API_KEY`, etc). Any [LiteLLM-compatible](https://docs.litellm.ai/docs/providers) model string works.
## Scoring
**Average final funds (USD) across seeds 1, 2, 3.** Bankrupt = $0.
```
score = average(max(0, final_funds_cents / 100) for each seed)
```
## Submitting to leaderboard
Open a PR on the model's HF repo adding `.eval_results/yc-bench.yaml`. See [`sample_eval_result.yaml`](sample_eval_result.yaml) in this repo for the format.
## License
Apache 2.0
---
pretty_name: "YC-Bench"
language:
- 英语
license: apache-2.0
size_categories:
- 样本数少于1000
task_categories:
- 文本生成
tags:
- 基准测试
- 智能体
- 长时序
- 仿真
- 评估
citation: |
@misc{collinear-ai2025ycbench,
author = {{Collinear AI}},
title = {{YC-Bench}: Your Company Bench — 面向大语言模型智能体的长时序连贯性基准测试},
year = {2025},
howpublished = {url{https://github.com/collinear-ai/yc-bench}}}
---
# YC-Bench 长时序智能体基准测试
大语言模型(LLM)将通过命令行工具(CLI)与确定性离散事件仿真环境交互,模拟担任AI初创公司CEO一整年的虚拟体验。测试维度涵盖员工分配、声望专精、现金流管理、截止日期风险管控、对抗性客户识别——全程覆盖数百轮交互。
来源:[github.com/collinear-ai/yc-bench](https://github.com/collinear-ai/yc-bench)
## 评估流程
从本仓库下载`run_yc_bench_job.py`,执行方式如下:
bash
hf jobs uv run run_yc_bench_job.py
--flavor cpu-basic --timeout 3h
--secrets OPENAI_API_KEY
-- openai/gpt-5.4
或本地运行:`uv run run_yc_bench_job.py openai/gpt-5.4`
该命令将在随机种子1至3上运行中等预设配置,并输出平均最终资金。需根据所使用的模型服务商传入对应密钥参数(如`ANTHROPIC_API_KEY`、`OPENROUTER_API_KEY`等)。所有兼容LiteLLM的模型字符串均可直接使用。
## 评分规则
**以随机种子1、2、3对应的平均最终资金(美元)作为最终评分。若初创公司破产,则最终资金为0美元。**
评分公式:
score = average(max(0, final_funds_cents / 100) for each seed)
## 提交至排行榜
向对应模型的Hugging Face仓库提交拉取请求,添加`.eval_results/yc-bench.yaml`结果文件。可参考本仓库中的`sample_eval_result.yaml`文件了解格式规范。
## 许可证
Apache 2.0
提供机构:
collinear-ai



