Synergies between ChemAxon's chemicalize and other open resources to extract structures from patents, discern SAR, and find intersects or similarities in PubChem.

presentation · 9 years ago
by Christopher Southan (ChrisDS Consulting)

Evaluation of indicates it not only complements and benchmarks commercial patent databases but also for academic groups interested in interrogating only small sections of the medicinal chemistry IP landscape, may provide an alternative. It has powerful flexibility when combined with other open tools and sources to identify not only novel drug discovery compounds but also link them to SAR. For example, patent office metadata queries, keyword searches in freepatentsonline (FPO) or EBI CiteExplore patent abstracts give useful drug target recall via gene names and simple Boolean conditions (e.g. DPPIV OR dipeptidyl peptidase AND inhibitor). Identified patent texts can then be chemicalized in seconds from any espacenet or FPO URL and quickly eyeballed for drug-like exemplars. Additional IUPAC conversions in problematic cases can be accomplished by re-inputting text from Notepad and iterating OCR fixes such as I>1. For recalcitrant cases the OPSIN tool ( will sometimes succeed or highlight the error point. Relevant exemplar listings, separated from common reagents, can be surfaced in a convenient form (e.g. to allow SDF download from chemicalize followed by uploading to PubChem. In this case, a DPPIV filing by Takeda, 68 of the 99 structures had exact matches. By aggregating the PubChem patent chemistry sources of Thomson Pharma, IBM and EPO at just over 5 million CIDs we can intersect these with 48 of the 68 from US7687625 by exact match. We can also intersect with ~ 0.7 million ChEMBL journal curated compounds, in this case identifying CID 11153488 as an aloglyptin analogue published by Takeda. Linking IC50 or Ki SAR results requires inspection of data tables in the PDFs and matching the more potent values to the examples converted by chemicalize. The individual SMILES can also be used to find PubChem exact matches or similarity clusters. The self-archiving feature of chemicalize has useful circularity for similarity detection as more patents and papers get extracted by more users. In combination with PubChem 2D or 3D clustering this allows "walking" between patent series. Not only can Chemical similarity alerts can be set via MyNCBI and patent searching alerts at FPO but also ( provides automated gene name-to-patent matches from EPO to Twitter. Thus, newly published patents with good IUPACs in any full-text source can be chemicalized, checked against PubChem in minutes and, on a good day, be linked to SAR data. By definition new structures should be PubChem-ve (i.e. novel). Some may eventually enter PubChem via commercial source feeds, but they can be archived locally using JChem for Excel for similarity checking against any source. Chemicalize also proved useful for extracting structures from abstracts and full-text articles. An example for the former was a recent Takeda DPPIV abstract (PMID:21764322) from which the lead structure was chemicalized, matched to CID 10271081, the PDB structure and a Takeda patent via the ChemSpider-to-SureChem link. An open full-text PubMed Central review on DDPIV inhibitors (PMID: 21847463) provided context to this area and chemicalize recognized structures for four approved glyptin anti-diabetic drugs.

Download slides