five

collinear-ai/yc-bench

收藏
Hugging Face2026-03-23 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/collinear-ai/yc-bench
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: "YC-Bench" language: - en license: apache-2.0 size_categories: - n<1K task_categories: - text-generation tags: - benchmark - agents - long-horizon - simulation - evaluation citation: | @misc{collinear-ai2025ycbench, author = {{Collinear AI}}, title = {{YC-Bench}: Your Company Bench — A Long-Horizon Coherence Benchmark for {LLM} Agents}, year = {2025}, howpublished = {\url{https://github.com/collinear-ai/yc-bench}}} --- # YC-Bench Long-horizon agent benchmark. The LLM plays CEO of an AI startup for 1 simulated year via CLI tool use against a deterministic discrete-event simulation. Tests: employee allocation, prestige specialization, cash flow, deadline risk, adversarial client detection — sustained over hundreds of turns. Source: [github.com/collinear-ai/yc-bench](https://github.com/collinear-ai/yc-bench) ## Evaluation Download `run_yc_bench_job.py` from this repo, then: ```bash hf jobs uv run run_yc_bench_job.py \ --flavor cpu-basic --timeout 3h \ --secrets OPENAI_API_KEY \ -- openai/gpt-5.4 ``` Or run locally: `uv run run_yc_bench_job.py openai/gpt-5.4` Runs medium preset on seeds 1-3 and reports average final funds. Pass the appropriate `--secrets` flag for your provider (`ANTHROPIC_API_KEY`, `OPENROUTER_API_KEY`, etc). Any [LiteLLM-compatible](https://docs.litellm.ai/docs/providers) model string works. ## Scoring **Average final funds (USD) across seeds 1, 2, 3.** Bankrupt = $0. ``` score = average(max(0, final_funds_cents / 100) for each seed) ``` ## Submitting to leaderboard Open a PR on the model's HF repo adding `.eval_results/yc-bench.yaml`. See [`sample_eval_result.yaml`](sample_eval_result.yaml) in this repo for the format. ## License Apache 2.0

--- pretty_name: "YC-Bench" language: - 英语 license: apache-2.0 size_categories: - 样本数少于1000 task_categories: - 文本生成 tags: - 基准测试 - 智能体 - 长时序 - 仿真 - 评估 citation: | @misc{collinear-ai2025ycbench, author = {{Collinear AI}}, title = {{YC-Bench}: Your Company Bench — 面向大语言模型智能体的长时序连贯性基准测试}, year = {2025}, howpublished = {url{https://github.com/collinear-ai/yc-bench}}} --- # YC-Bench 长时序智能体基准测试 大语言模型(LLM)将通过命令行工具(CLI)与确定性离散事件仿真环境交互,模拟担任AI初创公司CEO一整年的虚拟体验。测试维度涵盖员工分配、声望专精、现金流管理、截止日期风险管控、对抗性客户识别——全程覆盖数百轮交互。 来源:[github.com/collinear-ai/yc-bench](https://github.com/collinear-ai/yc-bench) ## 评估流程 从本仓库下载`run_yc_bench_job.py`,执行方式如下: bash hf jobs uv run run_yc_bench_job.py --flavor cpu-basic --timeout 3h --secrets OPENAI_API_KEY -- openai/gpt-5.4 或本地运行:`uv run run_yc_bench_job.py openai/gpt-5.4` 该命令将在随机种子1至3上运行中等预设配置,并输出平均最终资金。需根据所使用的模型服务商传入对应密钥参数(如`ANTHROPIC_API_KEY`、`OPENROUTER_API_KEY`等)。所有兼容LiteLLM的模型字符串均可直接使用。 ## 评分规则 **以随机种子1、2、3对应的平均最终资金(美元)作为最终评分。若初创公司破产,则最终资金为0美元。** 评分公式: score = average(max(0, final_funds_cents / 100) for each seed) ## 提交至排行榜 向对应模型的Hugging Face仓库提交拉取请求,添加`.eval_results/yc-bench.yaml`结果文件。可参考本仓库中的`sample_eval_result.yaml`文件了解格式规范。 ## 许可证 Apache 2.0
提供机构:
collinear-ai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作