Use of dual-filtering to create training sets leading to improved accuracy in quantitative structure-retention relationships modelling for hydrophilic interaction liquid chromatographic systems
The development of quantitative structure retention relationships (QSRR) having sufficient accuracy to support high performance liquid chromatography (HPLC) method development is still a major issue. To tackle this challenge, this study presents a novel QSRR methodology to select a training set of compounds for QSRR modelling (i.e. to filter the database to identify the most appropriate compounds for the training set). This selection is based on a dual filtering strategy which combines Tanimoto similarity (TS) searching as the primary filter and retention time (tR) similarity clustering as the secondary filter, using a database of pharmaceutical compound retention times collected over a wide range of hydrophilic interaction liquid chromatography (HILIC) systems. To employ tR similarity filtering, correlation to a molecular descriptor is used as a measure of retention time. For the retention time of a compound to be modelled a relationship between experimental chromatographic data and various molecular descriptors is calculated using a genetic algorithm-partial least squares (GA-PLS) regression. The proposed dual-filtering-based QSRR model significantly improves the retention time predictability compared to the diverse, global, and TS-based QSRR models, with an average root mean square error in prediction (RMSEP) of 11.01% over five different HILIC stationary phases. The average CPU time for implementing the proposed approach is less than 10 min, which makes it quite favorable for rapid method development in HILIC. In addition, interpretation of the molecular descriptors selected by this novel approach provided some insight into the HILIC mechanism.