InChIKey-Deduplicated ClassyFire/ChemOnt Label Collection
收藏Zenodo2026-05-15 更新2026-05-26 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.20108564
下载链接
链接失效反馈官方服务:
资源简介:
An InChIKey-deduplicated aggregation of ClassyFire / ChemOnt chemical class labels for 73,105,281 unique compounds, assembled from PubChem and ZINC20-aligned sources contributed by collaborating laboratories. This release supersedes v1 with two changes: the canonical five-tier hierarchical path is now correct for every row, and the non-hierarchical ClassyFire labels (intermediate_nodes, alternative_parents, geometric_descriptor, substituents, mapped_features) are exposed in a new chemont_other_json column.
These labels cannot be fully verified against the live ClassyFire model. The dataset is an aggregation of ClassyFire results contributed by several laboratories across different time periods, not a fresh classification run. The exact source of every individual label is not always recoverable. Re-classifying every compound through the official endpoint at classyfire.wishartlab.com to confirm that the labels match what ClassyFire would emit today is not practically feasible. The service accepts one InChIKey per request and is rate-limited to about 12 requests per minute, so a single re-classification pass over the ≈73 M compounds in this release would take more than two decades. Historical version differences or occasional classification errors may therefore be present. However, the same InChIKey is occasionally reported by several of the contributing sources, and their independent ClassyFire classifications agree on the labels we retain. Cross-source agreement therefore acts as a confidence signal in place of direct re-verification, because broad label corruption would have produced systematic disagreement between sources, which is not what we observe.
Identity is InChIKey-only. The dataset is keyed by the 27-character standard InChIKey. It is not tautomer-normalized and does not attempt cross-tautomer identity resolution. If you need that, normalize the SMILES yourself before joining.
Coverage is not complete. 704 of the 4,824 ChemOnt classes (14.6 %) do not appear on any molecule in this release, mostly in exotic inorganic and lanthanide or actinide chemistry that is genuinely under-represented in PubChem and ZINC20. See the Coverage breakdown section.
What changed since v1
Corrected hierarchical paths. The previous release inadvertently encoded 3,388,432 rows (4.64 %) in the ClassyFire v2.1 schema layout, in which the meta-root Chemical entities (numeric ID 9999999) occupies the kingdom slot and every other tier is shifted down by one, pushing the rightful subclass out of the five-slot path. The aggregator has been corrected to detect the v2.1 layout at label-mapping time and shift it back. All affected rows now carry the canonical [kingdom, superclass, class, subclass, direct_parent] layout, and 9999999 no longer appears in any tree slot.
Out-of-tree labels are now exposed. v1 carried only the five-slot hierarchical path. v2 adds a chemont_other_json column that records the non-hierarchical labels ClassyFire reports per molecule, namely intermediate_nodes, alternative_parents, geometric_descriptor, substituents, and mapped_features. Each value resolves to numeric ChemOnt IDs through the bundled dictionary. Including these labels raises class coverage from 3,798 / 4,824 (78.7 %) as a strict hierarchical-path label to 4,120 / 4,824 (85.4 %) anywhere in the classification.
Headline figures
Metric
Value
Unique InChIKeys (rows)
73,105,281
Rows with PubChem CID
68,693,298
Rows with ZINC20 ID
30,187,323
Rows carrying both PubChem CID and ZINC20 ID
25,775,344
Rows with at least one unresolved tree slot
119,795
ChemOnt classes seen at least once
4,120 / 4,824 (85.4 %)
ChemOnt classes never assigned to any molecule
704 / 4,824 (14.6 %)
Coverage breakdown
Distribution of the 4,824 ChemOnt classes by the number of molecules that carry the class as a label anywhere in the classification (five-slot hierarchical path plus the chemont_other_json fields).
Bucket
Classes
Share of 4,824
0 rows (class never assigned)
704
14.59 %
Exactly 1 row
74
1.53 %
≥ 1 row (covered)
4,120
85.41 %
≥ 10 rows
3,819
79.17 %
≥ 100 rows
3,258
67.54 %
≥ 1,000 rows
2,422
50.21 %
≥ 10,000 rows
1,489
30.87 %
≥ 100,000 rows
684
14.18 %
≥ 1,000,000 rows
183
3.79 %
Where the 704 never-seen classes cluster
Most of the unrepresented classes belong to exotic inorganic chemistry or to lanthanide and actinide oxoanionic chemistry, which is not present at scale in PubChem or ZINC20.
Children at zero coverage
Parent class
44
Actinide oxoanionic compounds
32
Metalloid oxoanionic compounds
29
Lanthanide oxoanionic compounds
24
Organic acids and derivatives
22
Post-transition metal oxoanionic compounds
14
Miscellaneous inorganic compounds
14
Alkaline earth metal oxoanionic compounds
13
Triterpenoids
12
Organic oxoanionic compounds
11
Nucleosides, nucleotides, and analogues
Call for contributions
If your group has run ClassyFire on compounds that fall into any of the 704 unrepresented classes or into rare classes with ten or fewer examples, please contribute them to the next release. The list of zero- coverage parent groups above and the Coverage breakdown section indicate which classes are most in need of additional evidence.
The most useful contribution is a TSV of (InChIKey, SMILES, ClassyFire JSON response) tuples covering one or more of those classes.
Files in this release
File
Description
classyfire_dedup_inchikey_smiles.enriched.tsv.zst
The labelled dataset, zstd-compressed TSV with one row per InChIKey
classyfire_dedup_inchikey_smiles.enriched.head20.tsv
First 20 data rows as a plain TSV, a preview of schema and content without downloading the full file
chemont_dictionary.tsv
ChemOnt numeric-ID, name, and parent map needed to decode the JSON columns
Dataset schema
Column
Description
inchikey
27-character standard InChIKey
cid
PubChem CID (empty when the compound is not in PubChem)
zinc_id
ZINC20 identifier (empty when the compound is not in ZINC)
smiles
Representative SMILES
chemont_tree_json
Five-element JSON array of numeric ChemOnt IDs for [kingdom, superclass, class, subclass, direct_parent]. null indicates an unresolved slot
chemont_other_json
JSON object with optional keys intermediate_nodes, alternative_parents, geometric_descriptor, substituents, mapped_features. Each value holds sorted, unique numeric ChemOnt IDs
提供机构:
Zenodo
创建时间:
2026-05-10



