five

xln3/bamboo-papers

收藏
Hugging Face2026-03-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/xln3/bamboo-papers
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation language: - en tags: - benchmark - paper-reproduction - code-generation size_categories: - 1K<n<10K --- # BAMBOO: Benchmark for Autonomous ML Build-and-Output Observation A large-scale benchmark for evaluating AI agents' ability to reproduce ML research papers using the authors' original code. ## Dataset Summary | Metric | Value | |--------|-------| | **Total papers** | 6,148 | | **Papers with PDF** | 5,495 (89%) | | **Papers with structured MD** | 3,983 (64%) | | **Venues** | ICML, ICLR, NeurIPS, CVPR, ICCV, ACL, EMNLP, AAAI, ICRA | | **Year** | 2025 | | **Code coverage** | 100% (all papers have verified code_url + code_commit) | | **Abstracts** | 100% | | **Difficulty scores** | 100% | ## Files - `bamboo_dataset.json` — Full paper metadata (6,148 entries) - `paper_pdfs/` — Original paper PDFs (5,495 files, ~32GB) - `paper_markdowns/` — MinerU hybrid-auto-engine extracted markdown (3,983 files) ## PDF Extraction PDFs are extracted using [MinerU](https://github.com/opendatalab/MinerU) v2.7.6 with the `hybrid-auto-engine` backend (highest quality VLM-based extraction). This preserves: - Correct paragraph ordering - Table structure as markdown - Mathematical formulas - Figure references ## Venue Breakdown (papers with MD) | Venue | Papers | |-------|--------| | ICML | 1,109 | | ICLR | 669 | | ICCV | 501 | | CVPR | 408 | | NeurIPS | 359 | | ACL | 327 | | EMNLP | 294 | | AAAI | 275 | | ICRA | 41 | ## Usage ```python from huggingface_hub import hf_hub_download import json # Download metadata path = hf_hub_download("xln3/bamboo-papers", "bamboo_dataset.json", repo_type="dataset") papers = json.load(open(path)) # Filter papers with markdown papers_with_md = [p for p in papers if p["has_md"]] print(f"{len(papers_with_md)} papers with structured markdown") # Download a specific paper's markdown md_path = hf_hub_download("xln3/bamboo-papers", "paper_markdowns/bamboo-00001.md", repo_type="dataset") ```
提供机构:
xln3
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作