THGLab/LinkLlama-cap50-train
收藏Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/THGLab/LinkLlama-cap50-train
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
---
# LinkLlama cap-50 training JSONL (`chembl36_balanced_cap50.jsonl`)
## Dataset summary
This file is the **supervised fine-tuning (SFT) corpus** used to train the **LinkLlama cap-50** model. Each line is one JSON object in an **instruction-style** layout (compatible with common trainers such as Axolotl / Alpaca-style fields).
**Provenance**
- Parent structures were drawn from **ChEMBL** (ChEMBL36 pipeline described in the LinkLlama paper).
- Molecules were **fragmented** into fragment–linker–fragment triplets; molecular and linker **properties** and **reasonability** heuristics were computed.
- A **cap-50** balancing scheme was applied so that no single linker SMILES appears more than 50 times in the final training set (reduces memorization of frequent linkers).
**Scale (approximate)**
- On the order of **~1.6M** training lines after balancing (exact count may vary slightly with pipeline version).
## File format
- **Format:** JSON Lines (`.jsonl`), UTF-8, one JSON object per line.
- **Fields:** Follow the LinkLlama / Axolotl Alpaca-style convention used in the public training configs (`instruction`, `input`, `output`, etc.). See the LinkLlama GitHub `linkllama/llm/sft_corpus.py` and paper for the exact prompt and response structure.
## Intended use
- Reproducing or extending **LinkLlama** fine-tuning.
- Research on **linker-focused** generative models and chemical NLP.
**Not intended for:** building general-purpose chat models or unrelated NLP benchmarks without additional curation.
## Hugging Face Hub note
For dataset repositories, paste the **Dataset summary**, **File format**, and **Citation** sections into the Hub `README.md` as well. Keep this `data.md` next to the `.jsonl` in the uploaded bundle so downloaders see documentation alongside the file.
## Limitations
- Reflects **ChEMBL-like** drug-like chemistry; coverage of exotic scaffolds is not guaranteed.
- Reasonability labels are **rule-based** heuristics (PAINS, REOS-like checks, ring patterns, etc.), not experimental validation.
## Citation
If you use this dataset, cite the LinkLlama preprint:
**bioRxiv:** https://www.biorxiv.org/content/10.64898/2026.04.15.718690v1
```bibtex
@article{sun_linkllama_2026,
title = {{LinkLlama}: {Enabling} {Large} {Language} {Model} for {Chemically} {Reasonable} {Linker} {Design}},
author = {Sun, Kunyang and Wang, Yingze Eric and Purnomo, Justin Clement and Cavanagh, Joseph M. and Alteri, Giovanni Battista and Head-Gordon, Teresa},
year = {2026},
doi = {10.64898/2026.04.15.718690},
url = {https://www.biorxiv.org/content/10.64898/2026.04.15.718690v1},
journal = {bioRxiv},
}
```
## License
ChEMBL content is subject to the **ChEMBL data license** (see EMBL-EBI ChEMBL terms for the release you used). This derived JSONL is provided for research reproducibility; ensure your use complies with ChEMBL and your institutional policies.
提供机构:
THGLab



