Finding answers from chemical space extremely fast
The complex nature of chemical graphs offers an immense source of variability for drug designers to tackle optimization challenges along the project pathway towards candidates. The difficulty lies within the exploration of the chemical space either by chemical intuition of medicinal chemists or by using enabling technologies, like cheminformatics tools. Real and virtual chemical spaces encompass broad scale of compound numbers and a vast potential to be exploited. An especially valuable sub-group is where measured data exists and stored, most commonly in relational databases. In our study both types, a very large compound collection and a medium size with extensive assay data were evaluated. As a read-out we used the cost associated with finding an answer for chemical questions, the search time. In the first use-case, the aim was to suggest novel analogues of known drugs using the largest publicly available enumerated compound collection, the GDB-13 counting 977M unique entries. This collection was screened with ultra-fast similarity search technique, using a subset of marketed drugs, where ~4 sec elapsed search time was measured constantly on a commercially available server (EC2, r3.8xlarge) using standard 1k fingerprint. Top 100 most similar compounds were cross filtered with the database of exemplified structures from patents (SureChEMBL DB) to fetch novel moieties with higher tendency to be in freedom to operate space (Fig. 1.). In the second part search performance on the entire data from ChEMBL DB was measured with three search types (duplicate, similarity and substructure) and joined queries. These joined queries represent complex questions asked from data warehouses in pharmaceutical industry, where performance is a key indicator due to massive load. The aim is to provide realistic speed statistics measured with chemical cartridge extending Oracle and the new generation engine running on PostgreSQL. Significant speed up was measured using the new search engine, especially on combined queries, where 100x speed up was achieved and median search time was in a range on ~100 milliseconds falling below the recognition time limit.
Figure 1.Example drug and its novel analogues identified from GDB-13. Tversky dissimilarity >0 rules out substructure match in SureChEMBL.