five

Labradorlabs/bsca-sca-benchmark-v1

收藏
Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Labradorlabs/bsca-sca-benchmark-v1
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - feature-extraction language: - code tags: - binary-analysis - decompilation - software-composition-analysis - sca - benchmark - security pretty_name: BSCA SCA Benchmark v1 size_categories: - n<1K --- # BSCA SCA Benchmark v1 Benchmark dataset for evaluating **Binary Software Composition Analysis (SCA)** — identifying which open-source libraries are present in a stripped binary. ## Dataset Description 10 real-world stripped ELF binaries with ground-truth component labels. Used to evaluate embedding-based SCA systems that match binary functions against open-source code. > **Attribution:** The binaries and ground-truth labels in this dataset are derived and refined from the artifact of the paper: > *"Are We There Yet? Filling the Gap Between Binary Similarity Analysis and Binary Software Composition Analysis"* > — Artifact: https://sites.google.com/view/bsa2bsca/home/artifact ### Fields | Field | Type | Description | |---|---|---| | `binary_name` | string | Filename of the binary (e.g. `db_bench.bin`) | | `binary` | bytes | Raw binary content (stripped ELF / shared object) | | `components` | list[dict] | Ground-truth open-source components included in the binary | | `components[].source` | string | Git repository URL of the component | | `components[].tag` | list[string] | Version tags (empty list = version unknown) | ## Binaries | Binary | Size | Components | |---|---|---| | controlblock.bin | 683 KB | 8 | | db_bench.bin | 5.5 MB | 15 | | dosbox_core_libretro.so | 7.1 MB | 19 | | example.bin | 773 KB | 6 | | hyriseSystemTest.bin | 4.0 MB | 10 | | kvrocks.bin | 11.9 MB | 9 | | pagespeed_automatic_test.bin | 29.1 MB | 34 | | prometheus_test.bin | 2.3 MB | 5 | | replay-sorcery.bin | 5.5 MB | 11 | | turbobench.bin | 4.4 MB | 30 | - **Architecture:** x86-64 - **Compiler:** GCC (various versions) - **Optimization:** -O2 - **Strip:** Fully stripped (no symbol names) ## Usage ```python from datasets import load_dataset ds = load_dataset("Labradorlabs/bsca-sca-benchmark-v1", split="train") for row in ds: print(row["binary_name"]) print([c["source"] for c in row["components"]]) # row["binary"] contains raw binary bytes ``` ### Evaluation ```python # Write binary to disk for analysis import tempfile, os from datasets import load_dataset ds = load_dataset("Labradorlabs/bsca-sca-benchmark-v1", split="train") row = ds[0] # db_bench with tempfile.NamedTemporaryFile(suffix=".bin", delete=False) as f: f.write(row["binary"]) bin_path = f.name # Run your SCA tool on bin_path and compare against row["components"] ``` ## Evaluation Metric Component-level **Precision / Recall / F1** on identifying included open-source libraries: - **Predicted:** set of repository URLs your SCA tool identifies - **Ground truth:** `components[].source` URLs ``` Precision = |predicted ∩ ground_truth| / |predicted| Recall = |predicted ∩ ground_truth| / |ground_truth| F1 = 2 * P * R / (P + R) ``` ## Related Models - [`Labradorlabs/bsca-bge-micro-v2-contrastive-v1`](https://huggingface.co/Labradorlabs/bsca-bge-micro-v2-contrastive-v1) - [`Labradorlabs/bsca-all-minilm-contrastive-v1`](https://huggingface.co/Labradorlabs/bsca-all-minilm-contrastive-v1) ## Citation If you use this dataset, please also cite the original paper from which the binaries and ground truth were derived: ``` @inproceedings{bsa2bsca, title={Are We There Yet? Filling the Gap Between Binary Similarity Analysis and Binary Software Composition Analysis}, url={https://sites.google.com/view/bsa2bsca/home/artifact} } ``` ``` @misc{bsca-benchmark-2026, title={BSCA SCA Benchmark v1}, author={Labradorlabs}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/Labradorlabs/bsca-sca-benchmark-v1} } ```
提供机构:
Labradorlabs
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作