five

InChIKey-Deduplicated ClassyFire/ChemOnt Label Collection

收藏
Zenodo2026-05-15 更新2026-05-26 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.20108564
下载链接
链接失效反馈
官方服务:
资源简介:
An InChIKey-deduplicated aggregation of ClassyFire / ChemOnt chemical class labels for 73,105,281 unique compounds, assembled from PubChem and ZINC20-aligned sources contributed by collaborating laboratories. This release supersedes v1 with two changes: the canonical five-tier hierarchical path is now correct for every row, and the non-hierarchical ClassyFire labels (intermediate_nodes, alternative_parents, geometric_descriptor, substituents, mapped_features) are exposed in a new chemont_other_json column. These labels cannot be fully verified against the live ClassyFire model. The dataset is an aggregation of ClassyFire results contributed by several laboratories across different time periods, not a fresh classification run. The exact source of every individual label is not always recoverable. Re-classifying every compound through the official endpoint at classyfire.wishartlab.com to confirm that the labels match what ClassyFire would emit today is not practically feasible. The service accepts one InChIKey per request and is rate-limited to about 12 requests per minute, so a single re-classification pass over the ≈73 M compounds in this release would take more than two decades. Historical version differences or occasional classification errors may therefore be present. However, the same InChIKey is occasionally reported by several of the contributing sources, and their independent ClassyFire classifications agree on the labels we retain. Cross-source agreement therefore acts as a confidence signal in place of direct re-verification, because broad label corruption would have produced systematic disagreement between sources, which is not what we observe. Identity is InChIKey-only. The dataset is keyed by the 27-character standard InChIKey. It is not tautomer-normalized and does not attempt cross-tautomer identity resolution. If you need that, normalize the SMILES yourself before joining. Coverage is not complete. 704 of the 4,824 ChemOnt classes (14.6 %) do not appear on any molecule in this release, mostly in exotic inorganic and lanthanide or actinide chemistry that is genuinely under-represented in PubChem and ZINC20. See the Coverage breakdown section. What changed since v1 Corrected hierarchical paths. The previous release inadvertently encoded 3,388,432 rows (4.64 %) in the ClassyFire v2.1 schema layout, in which the meta-root Chemical entities (numeric ID 9999999) occupies the kingdom slot and every other tier is shifted down by one, pushing the rightful subclass out of the five-slot path. The aggregator has been corrected to detect the v2.1 layout at label-mapping time and shift it back. All affected rows now carry the canonical [kingdom, superclass, class, subclass, direct_parent] layout, and 9999999 no longer appears in any tree slot. Out-of-tree labels are now exposed. v1 carried only the five-slot hierarchical path. v2 adds a chemont_other_json column that records the non-hierarchical labels ClassyFire reports per molecule, namely intermediate_nodes, alternative_parents, geometric_descriptor, substituents, and mapped_features. Each value resolves to numeric ChemOnt IDs through the bundled dictionary. Including these labels raises class coverage from 3,798 / 4,824 (78.7 %) as a strict hierarchical-path label to 4,120 / 4,824 (85.4 %) anywhere in the classification. Headline figures Metric Value Unique InChIKeys (rows) 73,105,281 Rows with PubChem CID 68,693,298 Rows with ZINC20 ID 30,187,323 Rows carrying both PubChem CID and ZINC20 ID 25,775,344 Rows with at least one unresolved tree slot 119,795 ChemOnt classes seen at least once 4,120 / 4,824 (85.4 %) ChemOnt classes never assigned to any molecule 704 / 4,824 (14.6 %) Coverage breakdown Distribution of the 4,824 ChemOnt classes by the number of molecules that carry the class as a label anywhere in the classification (five-slot hierarchical path plus the chemont_other_json fields). Bucket Classes Share of 4,824 0 rows (class never assigned) 704 14.59 % Exactly 1 row 74 1.53 % ≥ 1 row (covered) 4,120 85.41 % ≥ 10 rows 3,819 79.17 % ≥ 100 rows 3,258 67.54 % ≥ 1,000 rows 2,422 50.21 % ≥ 10,000 rows 1,489 30.87 % ≥ 100,000 rows 684 14.18 % ≥ 1,000,000 rows 183 3.79 % Where the 704 never-seen classes cluster Most of the unrepresented classes belong to exotic inorganic chemistry or to lanthanide and actinide oxoanionic chemistry, which is not present at scale in PubChem or ZINC20. Children at zero coverage Parent class 44 Actinide oxoanionic compounds 32 Metalloid oxoanionic compounds 29 Lanthanide oxoanionic compounds 24 Organic acids and derivatives 22 Post-transition metal oxoanionic compounds 14 Miscellaneous inorganic compounds 14 Alkaline earth metal oxoanionic compounds 13 Triterpenoids 12 Organic oxoanionic compounds 11 Nucleosides, nucleotides, and analogues Call for contributions If your group has run ClassyFire on compounds that fall into any of the 704 unrepresented classes or into rare classes with ten or fewer examples, please contribute them to the next release. The list of zero- coverage parent groups above and the Coverage breakdown section indicate which classes are most in need of additional evidence. The most useful contribution is a TSV of (InChIKey, SMILES, ClassyFire JSON response) tuples covering one or more of those classes. Files in this release File Description classyfire_dedup_inchikey_smiles.enriched.tsv.zst The labelled dataset, zstd-compressed TSV with one row per InChIKey classyfire_dedup_inchikey_smiles.enriched.head20.tsv First 20 data rows as a plain TSV, a preview of schema and content without downloading the full file chemont_dictionary.tsv ChemOnt numeric-ID, name, and parent map needed to decode the JSON columns Dataset schema Column Description inchikey 27-character standard InChIKey cid PubChem CID (empty when the compound is not in PubChem) zinc_id ZINC20 identifier (empty when the compound is not in ZINC) smiles Representative SMILES chemont_tree_json Five-element JSON array of numeric ChemOnt IDs for [kingdom, superclass, class, subclass, direct_parent]. null indicates an unresolved slot chemont_other_json JSON object with optional keys intermediate_nodes, alternative_parents, geometric_descriptor, substituents, mapped_features. Each value holds sorted, unique numeric ChemOnt IDs
提供机构:
Zenodo
创建时间:
2026-05-10
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作