collinear-ai/yc-bench

Name: collinear-ai/yc-bench
Creator: collinear-ai
Published: 2026-03-23 18:17:49
License: 暂无描述

Hugging Face2026-03-23 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/collinear-ai/yc-bench

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: "YC-Bench" language: - en license: apache-2.0 size_categories: - n<1K task_categories: - text-generation tags: - benchmark - agents - long-horizon - simulation - evaluation citation: | @misc{collinear-ai2025ycbench, author = {{Collinear AI}}, title = {{YC-Bench}: Your Company Bench — A Long-Horizon Coherence Benchmark for {LLM} Agents}, year = {2025}, howpublished = {\url{https://github.com/collinear-ai/yc-bench}}} --- # YC-Bench Long-horizon agent benchmark. The LLM plays CEO of an AI startup for 1 simulated year via CLI tool use against a deterministic discrete-event simulation. Tests: employee allocation, prestige specialization, cash flow, deadline risk, adversarial client detection — sustained over hundreds of turns. Source: [github.com/collinear-ai/yc-bench](https://github.com/collinear-ai/yc-bench) ## Evaluation Download `run_yc_bench_job.py` from this repo, then: ```bash hf jobs uv run run_yc_bench_job.py \ --flavor cpu-basic --timeout 3h \ --secrets OPENAI_API_KEY \ -- openai/gpt-5.4 ``` Or run locally: `uv run run_yc_bench_job.py openai/gpt-5.4` Runs medium preset on seeds 1-3 and reports average final funds. Pass the appropriate `--secrets` flag for your provider (`ANTHROPIC_API_KEY`, `OPENROUTER_API_KEY`, etc). Any [LiteLLM-compatible](https://docs.litellm.ai/docs/providers) model string works. ## Scoring **Average final funds (USD) across seeds 1, 2, 3.** Bankrupt = $0. ``` score = average(max(0, final_funds_cents / 100) for each seed) ``` ## Submitting to leaderboard Open a PR on the model's HF repo adding `.eval_results/yc-bench.yaml`. See [`sample_eval_result.yaml`](sample_eval_result.yaml) in this repo for the format. ## License Apache 2.0

--- pretty_name: "YC-Bench" language: - 英语 license: apache-2.0 size_categories: - 样本数少于1000 task_categories: - 文本生成 tags: - 基准测试 - 智能体 - 长时序 - 仿真 - 评估 citation: | @misc{collinear-ai2025ycbench, author = {{Collinear AI}}, title = {{YC-Bench}: Your Company Bench — 面向大语言模型智能体的长时序连贯性基准测试}, year = {2025}, howpublished = {url{https://github.com/collinear-ai/yc-bench}}} --- # YC-Bench 长时序智能体基准测试大语言模型（LLM）将通过命令行工具（CLI）与确定性离散事件仿真环境交互，模拟担任AI初创公司CEO一整年的虚拟体验。测试维度涵盖员工分配、声望专精、现金流管理、截止日期风险管控、对抗性客户识别——全程覆盖数百轮交互。来源：[github.com/collinear-ai/yc-bench](https://github.com/collinear-ai/yc-bench) ## 评估流程从本仓库下载`run_yc_bench_job.py`，执行方式如下： bash hf jobs uv run run_yc_bench_job.py --flavor cpu-basic --timeout 3h --secrets OPENAI_API_KEY -- openai/gpt-5.4 或本地运行：`uv run run_yc_bench_job.py openai/gpt-5.4` 该命令将在随机种子1至3上运行中等预设配置，并输出平均最终资金。需根据所使用的模型服务商传入对应密钥参数（如`ANTHROPIC_API_KEY`、`OPENROUTER_API_KEY`等）。所有兼容LiteLLM的模型字符串均可直接使用。 ## 评分规则 **以随机种子1、2、3对应的平均最终资金（美元）作为最终评分。若初创公司破产，则最终资金为0美元。** 评分公式： score = average(max(0, final_funds_cents / 100) for each seed) ## 提交至排行榜向对应模型的Hugging Face仓库提交拉取请求，添加`.eval_results/yc-bench.yaml`结果文件。可参考本仓库中的`sample_eval_result.yaml`文件了解格式规范。 ## 许可证 Apache 2.0

提供机构：

collinear-ai

5,000+

优质数据集

54 个

任务类型

进入经典数据集