Cloudy with a touch of Cheminformatics
Remote hosting (a.k.a. cloud computing) has become a popular topic in recent years with many vendors providing such cloud based services. The cloud enables the use of massive, on demand distributed computing resources. While one computational paradigm is to simply use these resources as multiple independent nodes - hosting identical copies of the same program, an alternative approach is to use these resources to support parallel computation. One of the technologies that enables the latter approach is Hadoop, an Open Source implementation of the map/reduce framework. In this talk I will first provide an overview of the Hadoop ecosystem - covering HDFS, Pig and HBase - and then discuss how the JChem library is integrated with Hadoop to support distributed cheminformatics processing, over arbitrarily large data sets. Examples will range from simple SMARTS matching and simple descriptors to bioisostere analysis. The talk will also discuss some of the more fundamental bottlenecks of map/reduce with respect to cheminformatics applications and consider possible "big-data" scenarios in cheminformatics that could fully utilize the map/reduce paradigm.