five

Hauser7733/VenusX_Frag_BindI_MF50_MCQ4

收藏
Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Hauser7733/VenusX_Frag_BindI_MF50_MCQ4
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - multiple-choice - question-answering language: - en tags: - protein - bioinformatics - interpro - mcq - venusx size_categories: - 1K<n<10K --- # VenusX Fragment MCQ — BindI (MF50) 4-choice multiple-choice question reformulation of the **VenusX Fragment-level BindI** sub-task from the paper [VenusX: Unlocking Fine-Grained Functional Understanding of Proteins](https://arxiv.org/abs/2505.11812) (ICLR 2026). This fork adapts the original N-way classification task into a 4-choice MCQ so that **zero-shot LLMs** can be evaluated fairly. The original task required the model to output one of 76 InterPro class labels—infeasible for a CausalLM that cannot know which IPR IDs are in the benchmark's label subspace. ## Schema Each sample is one multiple-choice question with exactly **one** correct answer. | Field | Type | Description | |-------|------|-------------| | `uid` | str | Original VenusX sample UID | | `seq_fragment` | str | The protein fragment amino acid sequence | | `annotation` | str | `"BindI"` (sub-task name) | | `interpro_label` | int | Original VenusX integer label (preserved for compatibility) | | `correct_ipr` | str | The correct InterPro accession (e.g. `IPR000169`) | | `correct_letter` | str | `"A"` / `"B"` / `"C"` / `"D"` — the letter whose option matches `correct_ipr` | | `option_{a,b,c,d}_ipr` | str | InterPro accession for each option | | `option_{a,b,c,d}_desc` | str | Human-readable description from InterPro `entry.list` | | `distractor_source` | str | How the 3 distractors were picked: `hierarchy`, `mixed`, or `pool` (see below) | ## How the MCQ is constructed For each sample, we start with one golden InterPro ID (the original label) and pick 3 distractors via a **2-tier fallback**: 1. **Tier 1 — InterPro hierarchy siblings**: If the golden IPR is in the InterPro hierarchy tree, take true siblings (same parent, not the golden itself, not any ancestor/descendant of the golden). We use up to 3. 2. **Tier 2 — Same sub-task label pool (random)**: Fill the remaining slots by random sampling from the full label pool of the same sub-task, excluding the golden, all its ancestors/descendants, and any Tier 1 picks. All 4 options (1 golden + 3 distractors) are then shuffled to randomize letter positions (A/B/C/D), using a **deterministic per-sample seed** derived from `{annotation}:{split}:{uid}:{golden}` so every dataset build is bit-identical and reviewers can independently verify each MCQ. The `distractor_source` column records which strategy was used: - `hierarchy` — all 3 distractors are InterPro siblings - `mixed` — some are siblings, some are random pool samples - `pool` — all 3 are random pool samples (most common for `Active_site` / `Binding_site` / `Conserved_site` types, which have **no** InterPro hierarchy per EBI convention) ## Why descriptions are included The original free-text task expected the LLM to directly output an IPR ID like `IPR019757`. This is unfair because the LLM has no way to know which specific IPRs are in the benchmark's small label subspace. Our MCQ format exposes the 4 candidate options **with human-readable names** (e.g. `IPR019757 — Peptidase S26A, signal peptidase I, lysine active site`) so that the LLM can use its biological knowledge to match the fragment's features against the candidate functional descriptions. ## Example ``` Fragment: IHCIAGLGRTP A) IPR033694 — Pyroglutamyl peptidase I, Cys active site B) IPR023411 — Ribonuclease A, active site C) IPR016130 — Protein-tyrosine phosphatase, active site ← correct D) IPR000169 — Cysteine peptidase, cysteine active site [ANSWER]C[/ANSWER] ``` ## Build Reproducibility This dataset is fully reproducible from the included build scripts and reference files: ``` scripts/ parse_interpro.py # Parses InterPro flat files into a queryable cache build_mcq.py # Builds MCQ samples with 2-tier distractor fallback reference/ entry.list # InterPro entries dump (downloaded 2026-04-09) ParentChildTreeFile.txt # InterPro hierarchy tree (downloaded 2026-04-09) label_pool.json # Union label pool across all 5 sub-tasks ``` To rebuild: ```bash # 1. Refresh InterPro flat files (optional — pinned versions included) curl -O https://ftp.ebi.ac.uk/pub/databases/interpro/current_release/entry.list curl -O https://ftp.ebi.ac.uk/pub/databases/interpro/current_release/ParentChildTreeFile.txt # 2. Parse into cache python scripts/parse_interpro.py # 3. Build MCQ (loads original VenusX Fragment datasets from AI4Protein/*) python scripts/build_mcq.py ``` ## Known Limitations 1. **Distractors are automatically generated**, not peer-reviewed. Unlike MMLU or MedMCQA whose distractors are human-written by exam boards, our distractors come from random sampling / hierarchy traversal. Some may be too easy (e.g. a completely unrelated domain for a specific active site), inflating accuracy. 2. **InterPro hierarchy coverage is low** for `Active_site`, `Binding_site`, `Conserved_site`, and `Repeat` entry types — EBI does not arrange these into hierarchies. As a result, Tier 1 (sibling-based) distractors only apply to a minority of samples (see `distractor_source` column for each sample's strategy). 3. **Random baseline is 25%** (not 1/N of the original label space). Accuracy numbers from this MCQ benchmark should be interpreted against this 25% baseline, not the paper's full-label-space accuracy. 4. **`interpro_label` field is preserved for traceability but not used** in MCQ scoring. MCQ scoring compares `pred_letter` to `correct_letter`. 5. **Not comparable to VenusX paper Table 4** numbers. The paper reports ESM2 probe accuracy on the full label space; we report LLM accuracy on a 4-way MCQ. Same metric name (ACC), different semantics. ## Citation If you use this MCQ-reformulated dataset, please cite both the original VenusX paper and this fork: ```bibtex @inproceedings{venusx2026, title={VenusX: Unlocking Fine-Grained Functional Understanding of Proteins}, author={Tan, Yang and others}, booktitle={ICLR}, year={2026}, url={https://arxiv.org/abs/2505.11812} } ``` ## References The MCQ reformulation methodology draws from the following literature: - Hendrycks et al., *Measuring Massive Multitask Language Understanding* (MMLU), ICLR 2021 - Pal et al., *MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering*, CHIL 2022. https://arxiv.org/abs/2203.14371 - El-Sanyoury et al., *Automatic distractor generation in multiple-choice questions: a systematic literature review*, PeerJ Computer Science 2024. https://pmc.ncbi.nlm.nih.gov/articles/PMC11623049/ - Susanti, Iida & Tokunaga, *Automatic Generation of English Vocabulary Tests* (WordNet-based distractor), CSEDU 2015 - Gene Ontology sibling negatives: Frontiers in Genetics 2020, BMC Bioinformatics 2009 ## License Derived from VenusX (AI4Protein). InterPro data is licensed CC-BY-4.0 from EMBL-EBI. This fork is released under CC-BY-4.0. ## Contact Built by [hauser7733](https://huggingface.co/hauser7733) as part of the [SiEval](https://github.com/scitix/sieval) evaluation framework. Questions or issues: open an issue at the SiEval repo.
提供机构:
Hauser7733
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作