Labradorlabs/bsca-sca-benchmark-v1
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Labradorlabs/bsca-sca-benchmark-v1
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- feature-extraction
language:
- code
tags:
- binary-analysis
- decompilation
- software-composition-analysis
- sca
- benchmark
- security
pretty_name: BSCA SCA Benchmark v1
size_categories:
- n<1K
---
# BSCA SCA Benchmark v1
Benchmark dataset for evaluating **Binary Software Composition Analysis (SCA)** — identifying which open-source libraries are present in a stripped binary.
## Dataset Description
10 real-world stripped ELF binaries with ground-truth component labels. Used to evaluate embedding-based SCA systems that match binary functions against open-source code.
> **Attribution:** The binaries and ground-truth labels in this dataset are derived and refined from the artifact of the paper:
> *"Are We There Yet? Filling the Gap Between Binary Similarity Analysis and Binary Software Composition Analysis"*
> — Artifact: https://sites.google.com/view/bsa2bsca/home/artifact
### Fields
| Field | Type | Description |
|---|---|---|
| `binary_name` | string | Filename of the binary (e.g. `db_bench.bin`) |
| `binary` | bytes | Raw binary content (stripped ELF / shared object) |
| `components` | list[dict] | Ground-truth open-source components included in the binary |
| `components[].source` | string | Git repository URL of the component |
| `components[].tag` | list[string] | Version tags (empty list = version unknown) |
## Binaries
| Binary | Size | Components |
|---|---|---|
| controlblock.bin | 683 KB | 8 |
| db_bench.bin | 5.5 MB | 15 |
| dosbox_core_libretro.so | 7.1 MB | 19 |
| example.bin | 773 KB | 6 |
| hyriseSystemTest.bin | 4.0 MB | 10 |
| kvrocks.bin | 11.9 MB | 9 |
| pagespeed_automatic_test.bin | 29.1 MB | 34 |
| prometheus_test.bin | 2.3 MB | 5 |
| replay-sorcery.bin | 5.5 MB | 11 |
| turbobench.bin | 4.4 MB | 30 |
- **Architecture:** x86-64
- **Compiler:** GCC (various versions)
- **Optimization:** -O2
- **Strip:** Fully stripped (no symbol names)
## Usage
```python
from datasets import load_dataset
ds = load_dataset("Labradorlabs/bsca-sca-benchmark-v1", split="train")
for row in ds:
print(row["binary_name"])
print([c["source"] for c in row["components"]])
# row["binary"] contains raw binary bytes
```
### Evaluation
```python
# Write binary to disk for analysis
import tempfile, os
from datasets import load_dataset
ds = load_dataset("Labradorlabs/bsca-sca-benchmark-v1", split="train")
row = ds[0] # db_bench
with tempfile.NamedTemporaryFile(suffix=".bin", delete=False) as f:
f.write(row["binary"])
bin_path = f.name
# Run your SCA tool on bin_path and compare against row["components"]
```
## Evaluation Metric
Component-level **Precision / Recall / F1** on identifying included open-source libraries:
- **Predicted:** set of repository URLs your SCA tool identifies
- **Ground truth:** `components[].source` URLs
```
Precision = |predicted ∩ ground_truth| / |predicted|
Recall = |predicted ∩ ground_truth| / |ground_truth|
F1 = 2 * P * R / (P + R)
```
## Related Models
- [`Labradorlabs/bsca-bge-micro-v2-contrastive-v1`](https://huggingface.co/Labradorlabs/bsca-bge-micro-v2-contrastive-v1)
- [`Labradorlabs/bsca-all-minilm-contrastive-v1`](https://huggingface.co/Labradorlabs/bsca-all-minilm-contrastive-v1)
## Citation
If you use this dataset, please also cite the original paper from which the binaries and ground truth were derived:
```
@inproceedings{bsa2bsca,
title={Are We There Yet? Filling the Gap Between Binary Similarity Analysis and Binary Software Composition Analysis},
url={https://sites.google.com/view/bsa2bsca/home/artifact}
}
```
```
@misc{bsca-benchmark-2026,
title={BSCA SCA Benchmark v1},
author={Labradorlabs},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/datasets/Labradorlabs/bsca-sca-benchmark-v1}
}
```
提供机构:
Labradorlabs



