THGLab/LinkLlama-cap50-train

Name: THGLab/LinkLlama-cap50-train
Creator: THGLab
Published: 2026-04-18 04:19:36
License: 暂无描述

Hugging Face2026-04-18 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/THGLab/LinkLlama-cap50-train

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit --- # LinkLlama cap-50 training JSONL (`chembl36_balanced_cap50.jsonl`) ## Dataset summary This file is the **supervised fine-tuning (SFT) corpus** used to train the **LinkLlama cap-50** model. Each line is one JSON object in an **instruction-style** layout (compatible with common trainers such as Axolotl / Alpaca-style fields). **Provenance** - Parent structures were drawn from **ChEMBL** (ChEMBL36 pipeline described in the LinkLlama paper). - Molecules were **fragmented** into fragment–linker–fragment triplets; molecular and linker **properties** and **reasonability** heuristics were computed. - A **cap-50** balancing scheme was applied so that no single linker SMILES appears more than 50 times in the final training set (reduces memorization of frequent linkers). **Scale (approximate)** - On the order of **~1.6M** training lines after balancing (exact count may vary slightly with pipeline version). ## File format - **Format:** JSON Lines (`.jsonl`), UTF-8, one JSON object per line. - **Fields:** Follow the LinkLlama / Axolotl Alpaca-style convention used in the public training configs (`instruction`, `input`, `output`, etc.). See the LinkLlama GitHub `linkllama/llm/sft_corpus.py` and paper for the exact prompt and response structure. ## Intended use - Reproducing or extending **LinkLlama** fine-tuning. - Research on **linker-focused** generative models and chemical NLP. **Not intended for:** building general-purpose chat models or unrelated NLP benchmarks without additional curation. ## Hugging Face Hub note For dataset repositories, paste the **Dataset summary**, **File format**, and **Citation** sections into the Hub `README.md` as well. Keep this `data.md` next to the `.jsonl` in the uploaded bundle so downloaders see documentation alongside the file. ## Limitations - Reflects **ChEMBL-like** drug-like chemistry; coverage of exotic scaffolds is not guaranteed. - Reasonability labels are **rule-based** heuristics (PAINS, REOS-like checks, ring patterns, etc.), not experimental validation. ## Citation If you use this dataset, cite the LinkLlama preprint: **bioRxiv:** https://www.biorxiv.org/content/10.64898/2026.04.15.718690v1 ```bibtex @article{sun_linkllama_2026, title = {{LinkLlama}: {Enabling} {Large} {Language} {Model} for {Chemically} {Reasonable} {Linker} {Design}}, author = {Sun, Kunyang and Wang, Yingze Eric and Purnomo, Justin Clement and Cavanagh, Joseph M. and Alteri, Giovanni Battista and Head-Gordon, Teresa}, year = {2026}, doi = {10.64898/2026.04.15.718690}, url = {https://www.biorxiv.org/content/10.64898/2026.04.15.718690v1}, journal = {bioRxiv}, } ``` ## License ChEMBL content is subject to the **ChEMBL data license** (see EMBL-EBI ChEMBL terms for the release you used). This derived JSONL is provided for research reproducibility; ensure your use complies with ChEMBL and your institutional policies.

提供机构：

THGLab

5,000+

优质数据集

54 个

任务类型

进入经典数据集