Clustering and diversity analysis for chemical libraries
JKlustor is a suite packed with clustering methods, that performs similarity and structure based clustering of compound libraries and focused sets - in both hierarchical and non-hierarchical fashions. The suite also carries out diversity calculations and library comparisons based on molecular fingerprints and other descriptors. It is an essential tool in combinatorial chemistry, virtual library design, and other areas where large numbers of compounds are analyzed.
Similarity based clustering
Hierarchical method: Ward’s minimum variance method, speeded up with Murtagh’s reciprocal nearest neighbour algorithm, creates tight and well separated clusters. It's recommended to use it with smaller data sets, like focused libraries with less than 100,000 structures. More on Ward clustering
Non-hierarchical methods: The Sphere Exclusion clustering is based on fingerprints and/or other numerical data, it can easily cope with millions of structures and it is suitable for diverse subset selection. K-means cluster analysis method aims to find the center of natural clusters in the input data in a way that minimizes the variance within each cluster. Finally, the Jarvis-Patrick (Jarp) method uses a nearest neighbor approach and performs variable-length clustering of chemical databases with hundreds of thousands of structures contained.
Structure based clustering
Hierarchical methods: LibraryMCS identifies the largest substructure shared by several molecular structures. It uses the hierarchical representation of clusters (dendograms), and it vizualizes an alternative tree and table view too. MCS profiling helps scientists explore screening results to quickly identify novel scaffolds and new examples of active compound families. The hierarchical SAR table enables viewing of clusters and associated non-structural data. R-group decomposition can be also performed using the MCS as the core structure for each cluster.
Non-hierarchical method: JKlustor makes clustering available for pre-generated use Bemis-Murcko frameworks structures, and therefore provides a convenient and quick way towards analyzing large databases with millions of compounds.
Extending capabilities with descriptors and command line tools
JKlustor can use ChemAxon’s proprietary chemical and pharmacophore fingerprint technology, and also other user defined descriptors such as BCUT. For example, predicted or measured phys-chem properties (like pKa, logP etc.) can also help with clustering.
Command line tools in JKlustor:
- Diversity analysis - with Compr command line it generates different types of similarity or dissimilarity comparisons within a dataset (also see Diversity Set Selection) .
- GenerateMD - generates molecular descriptors for molecular stuctures.
- Jarp - clustering by a modified Jarvis-Patrick method.
- Ward - clustering by Ward's hierarchic clustering method using the RNN approach.
- LibMCS - Maximum Common Substructure based hierarchical clustering.
- CreateView - composing a new SDfile from an SDfile and a data table - It's useful for viewing the clustered results.
JKlustor tools can be called upon from the command line or from the API of JChem. JKlustor runs on many operating systems and can integrate with many database engines. Full Java and .NET integration is supported, as well as, connected to Oracle, MySQL, MS SQL Server, DB2, PostgreSQL, Access, etc. databases. The LibMCS element comes with a standalone GUI that allows users to browse/navigate through large sets of data. Furthermore, maximum common edge sub-graph (MCES) and maximum common substructure (MCS) clustering methods are also available as ChemAxon components, for both, KNIME and Pipeline Pilot workflow management systems.