Hauser7733/VenusX_Frag_BindI_MF50_MCQ4
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Hauser7733/VenusX_Frag_BindI_MF50_MCQ4
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- multiple-choice
- question-answering
language:
- en
tags:
- protein
- bioinformatics
- interpro
- mcq
- venusx
size_categories:
- 1K<n<10K
---
# VenusX Fragment MCQ — BindI (MF50)
4-choice multiple-choice question reformulation of the **VenusX Fragment-level
BindI** sub-task from the paper [VenusX: Unlocking Fine-Grained Functional Understanding of Proteins](https://arxiv.org/abs/2505.11812) (ICLR 2026).
This fork adapts the original N-way classification task into a 4-choice MCQ
so that **zero-shot LLMs** can be evaluated fairly. The original task required
the model to output one of 76 InterPro class labels—infeasible for a
CausalLM that cannot know which IPR IDs are in the benchmark's label subspace.
## Schema
Each sample is one multiple-choice question with exactly **one** correct answer.
| Field | Type | Description |
|-------|------|-------------|
| `uid` | str | Original VenusX sample UID |
| `seq_fragment` | str | The protein fragment amino acid sequence |
| `annotation` | str | `"BindI"` (sub-task name) |
| `interpro_label` | int | Original VenusX integer label (preserved for compatibility) |
| `correct_ipr` | str | The correct InterPro accession (e.g. `IPR000169`) |
| `correct_letter` | str | `"A"` / `"B"` / `"C"` / `"D"` — the letter whose option matches `correct_ipr` |
| `option_{a,b,c,d}_ipr` | str | InterPro accession for each option |
| `option_{a,b,c,d}_desc` | str | Human-readable description from InterPro `entry.list` |
| `distractor_source` | str | How the 3 distractors were picked: `hierarchy`, `mixed`, or `pool` (see below) |
## How the MCQ is constructed
For each sample, we start with one golden InterPro ID (the original label) and
pick 3 distractors via a **2-tier fallback**:
1. **Tier 1 — InterPro hierarchy siblings**: If the golden IPR is in the
InterPro hierarchy tree, take true siblings (same parent, not the golden
itself, not any ancestor/descendant of the golden). We use up to 3.
2. **Tier 2 — Same sub-task label pool (random)**: Fill the remaining slots
by random sampling from the full label pool of the same sub-task,
excluding the golden, all its ancestors/descendants, and any Tier 1 picks.
All 4 options (1 golden + 3 distractors) are then shuffled to randomize
letter positions (A/B/C/D), using a **deterministic per-sample seed** derived
from `{annotation}:{split}:{uid}:{golden}` so every dataset build is
bit-identical and reviewers can independently verify each MCQ.
The `distractor_source` column records which strategy was used:
- `hierarchy` — all 3 distractors are InterPro siblings
- `mixed` — some are siblings, some are random pool samples
- `pool` — all 3 are random pool samples (most common for
`Active_site` / `Binding_site` / `Conserved_site` types, which have **no**
InterPro hierarchy per EBI convention)
## Why descriptions are included
The original free-text task expected the LLM to directly output an IPR ID like
`IPR019757`. This is unfair because the LLM has no way to know which specific
IPRs are in the benchmark's small label subspace. Our MCQ format exposes the
4 candidate options **with human-readable names** (e.g.
`IPR019757 — Peptidase S26A, signal peptidase I, lysine active site`) so that
the LLM can use its biological knowledge to match the fragment's features
against the candidate functional descriptions.
## Example
```
Fragment: IHCIAGLGRTP
A) IPR033694 — Pyroglutamyl peptidase I, Cys active site
B) IPR023411 — Ribonuclease A, active site
C) IPR016130 — Protein-tyrosine phosphatase, active site ← correct
D) IPR000169 — Cysteine peptidase, cysteine active site
[ANSWER]C[/ANSWER]
```
## Build Reproducibility
This dataset is fully reproducible from the included build scripts and
reference files:
```
scripts/
parse_interpro.py # Parses InterPro flat files into a queryable cache
build_mcq.py # Builds MCQ samples with 2-tier distractor fallback
reference/
entry.list # InterPro entries dump (downloaded 2026-04-09)
ParentChildTreeFile.txt # InterPro hierarchy tree (downloaded 2026-04-09)
label_pool.json # Union label pool across all 5 sub-tasks
```
To rebuild:
```bash
# 1. Refresh InterPro flat files (optional — pinned versions included)
curl -O https://ftp.ebi.ac.uk/pub/databases/interpro/current_release/entry.list
curl -O https://ftp.ebi.ac.uk/pub/databases/interpro/current_release/ParentChildTreeFile.txt
# 2. Parse into cache
python scripts/parse_interpro.py
# 3. Build MCQ (loads original VenusX Fragment datasets from AI4Protein/*)
python scripts/build_mcq.py
```
## Known Limitations
1. **Distractors are automatically generated**, not peer-reviewed. Unlike MMLU
or MedMCQA whose distractors are human-written by exam boards, our
distractors come from random sampling / hierarchy traversal. Some may be
too easy (e.g. a completely unrelated domain for a specific active site),
inflating accuracy.
2. **InterPro hierarchy coverage is low** for `Active_site`, `Binding_site`,
`Conserved_site`, and `Repeat` entry types — EBI does not arrange these
into hierarchies. As a result, Tier 1 (sibling-based) distractors only
apply to a minority of samples (see `distractor_source` column for each
sample's strategy).
3. **Random baseline is 25%** (not 1/N of the original label space).
Accuracy numbers from this MCQ benchmark should be interpreted against
this 25% baseline, not the paper's full-label-space accuracy.
4. **`interpro_label` field is preserved for traceability but not used** in
MCQ scoring. MCQ scoring compares `pred_letter` to `correct_letter`.
5. **Not comparable to VenusX paper Table 4** numbers. The paper reports
ESM2 probe accuracy on the full label space; we report LLM accuracy on
a 4-way MCQ. Same metric name (ACC), different semantics.
## Citation
If you use this MCQ-reformulated dataset, please cite both the original VenusX paper and this fork:
```bibtex
@inproceedings{venusx2026,
title={VenusX: Unlocking Fine-Grained Functional Understanding of Proteins},
author={Tan, Yang and others},
booktitle={ICLR},
year={2026},
url={https://arxiv.org/abs/2505.11812}
}
```
## References
The MCQ reformulation methodology draws from the following literature:
- Hendrycks et al., *Measuring Massive Multitask Language Understanding* (MMLU), ICLR 2021
- Pal et al., *MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering*, CHIL 2022. https://arxiv.org/abs/2203.14371
- El-Sanyoury et al., *Automatic distractor generation in multiple-choice questions: a systematic literature review*, PeerJ Computer Science 2024. https://pmc.ncbi.nlm.nih.gov/articles/PMC11623049/
- Susanti, Iida & Tokunaga, *Automatic Generation of English Vocabulary Tests* (WordNet-based distractor), CSEDU 2015
- Gene Ontology sibling negatives: Frontiers in Genetics 2020, BMC Bioinformatics 2009
## License
Derived from VenusX (AI4Protein). InterPro data is licensed CC-BY-4.0 from EMBL-EBI.
This fork is released under CC-BY-4.0.
## Contact
Built by [hauser7733](https://huggingface.co/hauser7733) as part of the
[SiEval](https://github.com/scitix/sieval) evaluation framework. Questions or
issues: open an issue at the SiEval repo.
提供机构:
Hauser7733



