ChemAxon European User Group Meeting (UGM), Budapest, May 19-21, 2014
In my September 2013 conclusion I said that ChemAxon was probably the market leader in mainstream chemical structure handling; we have now nearly reached the time when the word “probably” can be removed from that statement. Although there are companies that can compete with parts of the ChemAxon portfolio, and some of them perhaps perform better in a certain niche, there is no single company that offers the same breadth of solutions or equal value for money. ChemAxon is still small enough to turn around fast in response to changes in user demand (like the agile, small motor boat that one speaker mentioned), yet it is now large enough to command credibility and promise long-term sustainability and financial stability. I have previously mentioned the dangers of undertaking a project as ambitious as that of Plexus, but ChemAxon is responding to new competitive forces and knows that users now want not just a tool kit but off-the-shelf software, and solutions aimed at the benchtop end-user. Partnership has been a very successful play and the ChemAxon consultancy group has paid dividends. Cloud computing and hosting are big news now and ChemAxon has not failed to notice. I am pleased that I have already been invited to attend one of next year’s meetings: pleased not just because I find the meetings very useful and enjoyable, with lots of networking opportunities, but also because it is fascinating to watch the changes in this company over the years. Long may it continue its climb up the S-curve and long may I be the rapporteur!
The meeting was held at the Novotel Centrum in Budapest, as in 2012. About 70 users and partners gathered plus nearly 120 ChemAxon attendees. The ChemAxon team wore T-shirts indicating their business line roles, for example, “Markush”, or “Core” or “Plexus Suite”, to enable users to locate people with the right expertise to answer specific questions. This time there were no pre-meeting workshops but there was the usual one-to-one session for users to have personal meetings with ChemAxon staff, followed by the traditional garden party close to ChemAxon’s HQ in the Graphisoft Park. The next night the gala dinner was held in the “Ice Rink Hall” in City Park. In reality this was an outdoor buffet on a warm summer night, beside a lake with a picture postcard view of a floodlit castle. A few attendees even paddled boats on the lake. The next day, after the meeting ended, attendees still in town were treated to a two-hour informal walking tour in Pest, followed by food and drinks outdoors in a Budapest “ruin pub”. Photographs from these events are available on the ChemAxon web site event gallery.
Continuing a recent tradition, the UGM proper began not with a keynote address but with Alex Drijver, CEO of ChemAxon, interviewing Ian Berry of Evotec as follows.
Alex: We serve the life sciences and pharmaceuticals industries and our opening talk is usually from a senior customer in such an industry, but this year we have chosen someone from a CRO because so much R&D is now done outside the pharmaceutical industry. CROs are an important market for ChemAxon, so the keynote speaker this year is Ian Berry of Evotec.
Ian: Evotec is a CRO in chemistry, biological screening, assays, structural biology, DMPK, compound management and proteomics. The company employs 600 people. Its chemistry department, with 130 chemists, is reportedly the largest chemistry department in one place in the UK. The company has grown in an organic approach as well as through mergers and acquisitions. Unfortunately we closed our Indian operation last year for various reasons, including the fact that the majority of our customers wanted their work done in Europe or the US.
Alex: Must a CRO use the same informatics systems as its customer, for example, the same ELN? Is this a challenge?
Ian: It is not actually such a big deal. Some customers require us to use their systems and some use ours. We have an “Evotec process” that we like to use for as many customers as possible to provide cost saving and efficiency benefits. In the end, as a service provider, we deliver in whatever way the customer wants (but always hope to have some influence on their choices!). The 80/20 rule applies here: we are generally able to use optimal processes for 80% of our clients and have flexible import and export systems to handle the data from and to the clients. We use KNIME, for example, to link our way to theirs, to glue applications together and translate data between formats.
Alex: If you are into externalisation and collaboration, not just synthesis, what tools do you use?
Ian: This is a very interesting topic. What is externalisation? What is collaboration? Is it a shared view onto the data? Is it Marvin for meetings (shared structure drawing)? Is it the Schrödinger Live platform? Or more simply, WebEx or GoToMeeting? We often share desktops with clients through WebEx. In the end, Evotec essentially sells data. We can send SDfiles by (secure) email or post them on a file sharing platform (e.g., Project Place). In future we might provide a suite of Web Services to allow clients to consume or transfer their data securely onto their own servers from ours. Is this collaboration? In fee for service agreements we provide data, so there is little “collaboration”. In other cases, the customer has a project manager and scientists embedded in our team (“true collaboration”?). In a third kind of arrangement, the customers’ scientists work in an Evotec laboratory: they drive on our platform (in-sourcing, again, true collaboration). There are many ways to skin a cat.
Alex: Compound management as a service or ChemAxon registration: what is the way forward?
Ian: More and more will be done in the cloud. Collaborative Drug Discovery is using ChemAxon on the back-end as is Arxspan. This is a great idea (maybe in 10 years’ time) to reduce the estate. The problem is that if just one of our clients does not trust the cloud we would also need an internal solution. The biggest pharmas have their own systems, so Evotec can put stuff into those systems. Data exchange formats between companies are what matters.
Alex: You should join the Pistoia Alliance.
Ian: They are doing some great work in this area, but there is still a lot to do.
Alex: Should we mimic the way that scientists use social media, mobile etc.? In three to five years’ time what hardware and software will we be using?
Ian: Hardware is short lived in laboratories but we could use tablets for the ELN etc. There is a research lab in Cambridge that has interactive fume hoods and some systems use mobiles to send images and data directly to an ELN.
Alex: How about voice recognition?
Ian: Have you tried a strong Scottish accent on your voice recognition system recently? Speech recognition has potential but it is not there yet.
Alex: Thank you for being a ChemAxon customer and for agreeing to be interviewed.
Ian: It will be 10 years next year.
Five of the ChemAxon team opened the meeting proper with an overview of the evolution of ChemAxon products and some useful context. Tim Aitken began. ChemAxon was founded in 1998 and now has 140 staff, 75% of them in development. The company produces industry-leading, chemical data management tools. It is known for trusted partnerships with customers, and is seen as an innovator, with new products and three releases every year.
After drawing comes loading, using JChem for Excel, JChem for Office, IJC, or consultancy services. All the main file formats are supported in all ChemAxon products. Journals, patents and reports can be made searchable by extracting structures from names or images in documents. Structures and metadata are handled in MarvinView, JChem for Excel, IJC and Plexus Discovery. An indexed document archive is made with Document to Database and it can be used with IJC and JChem for SharePoint. New features in Name to Structure are described later.
Petr Hamernik outlined ChemAxon’s technologies for storage (and correction) of data and indicated which talks later in the meeting would be relevant. The technologies concerned are IJC, Compound Registration, Structure Checker and Standardizer, and JChemBase and JChem Cartridge. Related to these are Plexus Suite and Biomolecule Registration Toolkit. ChemAxon welcomes feedback from users on the new challenges posed by big data, and parallel and cloud computing.
Recent developments in Structure Checker include Markush structure checking, and checkers for reacting centre bond mark, stereo inversion retention mark, and double bond stereo. Highlights in Compound Registration are an improved user interface (with Marvin JS as the default editor), easy deployment, and customisation. Integration can be done with Web Services (e.g., with an ELN) and with IJC. Microsoft SQL Server 2005 and newer versions are now supported in IJC. Petr gave a demonstration of pick lists for adding to the data in a spreadsheet in IJC, and adding new columns.
Miklós Szabó talked about analysis and refinement of data. The drug discovery workflow is iterative rather than linear and collaborations are commonplace nowadays. ChemAxon is a partner in Lilly’s Open Innovation Drug Discovery programme, OIDD, European Lead Factory, and AstraZeneca Open Innovation. AstraZeneca has very recently launched a molecule profiling service.
ChemAxon’s technology for detecting overlap of chemical libraries in IJC (which could be available later this year) is exceedingly fast even without special hardware. The water solubility plugin is integrated into all ChemAxon tools. The Compliance Checker for controlled substances allows you to check whether your compounds are controlled according to the relevant laws of the countries of interest.
Another opportunity for collaboration is using chemicalize.org. According to a recent study the newest patents can be searched successfully on the chemicalize.org page and database (Southan, C.; Strácz, A. Extracting and connecting chemical structures from text sources using chemicalize.org. J Cheminf. 2013, 5, 20). IJC and Plexus Suite have multiple uses in medicinal chemistry projects. Text search highlighting and visualisation of relational data from multiple tables are new features of IJC. ChemAxon is also developing a Spotfire plugin. JChem for Office is another tool for collaboration.
Metabolizer will have a new validation set and a new library in version 6.4, and there will be performance improvements. ChemAxon has published recent work on 3D library design (Kalászi, A.; Szisz, D.; Imre, G.; Polgár, T. Screen3D: A Novel Fully Flexible High-Throughput Shape-Similarity Search Method. J. Chem. Inf. Model. 2014, 54, 1036–1049).
Lastly, Iván Solt talked about reporting and sharing using Plexus Suite, JChem for SharePoint, JChem for Office, MarvinSketch, IJC and Naming, which have links to each other and to compound registration and other databases. New features in IJC for reporting from a chemical database include enhanced charting, a palette of widgets, and rich text and HTML support. JChem for Office allows import from IJC, and databases, or by ID, leading to SAR analysis in Microsoft Excel, and reporting in other Office products. Naming integration, rich formatting options with Marvin, and use of publication formats allow chemistry transfer between Office applications. Marvin and Naming are used with patent documents. IJC and Plexus Suite allow data sharing in large organisations. JChem for SharePoint enhances collaboration between pharma and CRO, and knowledge sharing.
András Strácz of ChemAxon gave us an introduction to Plexus Discovery. This solution carries the library design functionality of Plexus Suite: database management, chemical searching, library enumeration, physicochemical property calculation, virtual synthesis and similarity searching. Scaffold enumeration allows the chemist to see what can be made with a certain scaffold and selected ligands. It is easy to use with an innovative editor and interactive preview. The building blocks collection allows ligand sets to be used as R-groups. Structure Checker is incorporated in version 6.3. Reaction enumeration tells the chemist how selected reagents can be transformed. It is an elegant new way to access Reactor, with simple reagent and reaction selection, a building blocks collection, interactive preview, and a built-in reaction library. Property calculation, customisable filters, JChem Search and Marvin JS are incorporated in Plexus Discovery. Similarity search gives access to structural similarity (with ECFP fingerprints), pharmacophore similarity, and single-click close analogue searching and scaffold hopping. Clustering uses the same fingerprints. (Presentation in our Library)
Tim Aitken of ChemAxon continued with details of the Plexus vision. ChemAxon provides a wide range of tools, and customers keep asking for more features. Delivering all these functions in one application leads to very flexible, extensible software but can be daunting to a new user. So ChemAxon is working to understand its users better, and deliver focused modules in one framework, the Plexus Suite. This is a single, intuitive, user interface, with easy access to data and analysis techniques, enabling scientists to access data from a variety of disparate sources and easily generate and share reports. It is built on IJC, which maintains user security, project privileges and forms. The first version focuses on medicinal chemistry and library design. The following features are available:
- Opening database forms defined in IJC
- Viewing form data as relational spreadsheet views
- Uploading and enumerating libraries
- Adding searched data and calculations
- Dynamic, intuitive querying on forms and spreadsheets
- Seeing linked items (in the same data tables)
- Saving and sharing updated hit lists and queries
- One-click linked charting
- Querying of multiple data sources from spreadsheet structures, showing forms containing related data, and adding multiple fields to the sheet
- Export to Microsoft Office.
Context-sensitive menus and an action bar increase intuitiveness and give a cleaner user interface, with scope for offering assay-specific functions to biologists, computational chemistry functions to computational chemists, synthesis to synthetic chemists, and so on, in future.
Tim presented a diagram of the concept. Currently, the functions in blue boxes below are achieved through IJC Web client; red boxes are achieved in Plexus Discovery. In the new product an IJC thick client is used to define data sources and users; Plexus Suite data is stored in IJC and JChem, and common browsers are used to access the Plexus user interface. Version 1 of should be released by the third quarter of 2014; early versions are promised from May onwards. (Presentation in our Library)
Roland Knispel of ChemAxon said that support for larger biological molecules is being added with the Hierarchical Editing Language for Macromolecules, HELM (Zhang, T.; Li, H.; Xi, H.; Stanton, R. V.; Rotstein, S. H. HELM: A Hierarchical Notation Language for Complex Biomolecule Structure Representation. J. Chem. Inf. Model. 2012, 52, 2796–2806). This is a work in progress but it has been shown that a chemically defined full IgG antibody can be registered faster than you can blink. In one use case, macrocyclic peptides, there has been a proof of concept for registration of the sequence and chemical structure, unambiguous representation of the peptide sequence, and the ability to enumerate a library quickly with residue replacements chosen from a large set of peptide building blocks. A shippable preview version is available for a template-based biomolecule registration toolkit, compatible with the Open HELM standard for representing large molecules, with a Web services API for integration.
Tim Dudgeon of ChemAxon outlined this new product. In April 2013 the Pistoia Alliance agreed to a proposal by ChemAxon and Patcore to develop Compliance Checker. In March 2014 a dataset of over 400 chemical structures to challenge the system was approved. The dataset contained both controlled and non-controlled structures across a range of chemical classes and legislative schedules. A compliance checker system has been marketed in Japan as CRAIS Checker since 2006 and is currently used by more than 30 companies. It is powered by JChem’s Markush technology, can be integrated easily, and is supported worldwide by ChemAxon and Patcore. It allows you to check whether what you already hold, or plan to synthesise, buy or ship is legal in the relevant countries, even before new legislation becomes active. The Compliance Checker server has Web and Window clients, and SOAP and command line interfaces.
János Papdeák of ChemAxon gave a case history. A customer wanted to use Marvin JS on a touch screen. The customer did not want the touch version to be different from the desktop version but eventually it had to differ. János pointed out that what the customer wants is not always what the customer needs, and the requirements can change during the development. Fortunately, in most cases no code is needed for a good minimum viable product (MVP) to validate new ideas. János also concluded that being wrong is as important as being right, because constructive discussions lead to an optimal and creative solution. ChemAxon invites feedback on the user interface. (Presentation in our Library)
Árpád Figyelmesi and Daniel Bonniot de Ruisselet of ChemAxon gave a presentation about this. In computer-assisted data extraction, English, Chinese and Japanese Name to Structure (N2S) speeds up the extraction process. Support for Japanese was added this year in Marvin 6.3. The Markush Editor helps to draw complex Markush structures. Structure Checker and Markush validation guarantee the high quality of extracted information. Markush representation, search and enumeration complete the package.
Árpád and Daniel measured overlap between English and Chinese patents using different data sources and tools. They presented these results:
The patents from the two OCR columns come from an (undisclosed) patent data provider. In the first of those columns ChemAxon's OCR error correction is disabled, in the second column it is enabled. In the third column are patents provided by SIPO (the Chinese patent office); these are fewer in number than the other dataset. For the first line of results, the translation to English is done by Google; in the second line ChemAxon technology is used. Given the structures found in the equivalent US patent of the same family, the percentages represent the ratio of structures that were found in the Chinese patent using the given method: the higher the percentage the better.
The old Markush Viewer has been replaced by the newly developed Markush Editor. It now features a hierarchical representation of fragments’ relationships; visualisation of the nesting view, and preview; separate editing of the individual fragments; and an integrated structure checker. In version 6.3 it is available as a desktop application and as a component capable of integration.
An upcoming API in version 6.4 will allow document annotation, and display of annotated patents and documents in PDF and XML. ChemCurator is the compound and Markush editor component. Users will be able to drag and drop structures from the document, connect between the document and extracted data, and validate the Markush structure against the examples. ChemCurator, available as a desktop application in version 6.4, will support a multi-display environment. In future, general document curation will be usable for extracting specific structures from scientific journal articles and internal company reports and for extracting exemplified structures from patents. There will be a wizard to detect relevant structures automatically, and exclude fragments and chemical elements etc.
The global consulting team is led by Tim Dudgeon and the US team by Erin Bolstad. There are consultants, developers and application scientists in Argentina, the Czech Republic, Hungary and the United States. Erin gave some project examples chosen from the following (in roughly increasing order of complexity):
- Training of end-users, developers and administrators
- Form and workflow migration, plus training
- Custom forms and scripts
- Cartridge and database migration
- Spotfire integration
- Patent-mining workflow and tool development
- Basic SAR database and access
- European Lead Factory
- Novartis reactions database
- Large scale project management for global pharma.
Mini-Reg is a simple, but flexible registration system handling chemistry and biology data in IJC with customised forms and scripts. It is used in about 10 companies now. The patent-mining workflow is a tool base for extracting structures and chemical text from project-specific patents and putting them into a relational database, with a query interface across linguistic-based, calculated information and chemical structures. It streamlines the process of Markush structure extraction, and exploration of chemical space and potential enumerated white space. DuPont’s Doc2DB database is a system to trawl the DuPont ELN and stored documents, with a Web application for complex queries. It is based on ChemAxon’s Doc2DB technology.
Next, Erin showed some screenshots from the current Spotfire integration project. The Novartis reactions database is a reactions data warehouse with import from various legacy ISIS databases and a live feed from the CambridgeSoft ELN. ISIS reaction databases were replaced with a ChemAxon database and structures from the CambridgeSoft ELN are incorporated into this database on a daily basis, with the information being made available in an AJAX-style Web application providing query and filtering capabilities, with a Web services interface for fitting into SOA. Large scale project management for global pharma consisted of an IJC roll-out as a global reporting tool in GSK, training, customisation of IJC, and additional administrative software creation. It took over three years. In BMS, IJC customisation for global roll-out needs, training, migration of data, consulting on data-mart integration, and thin client development took over two years. (Presentation in our Library)
Tim Dudgeon gave more details about European Lead Factory. This is an alliance of participants (members of the European Federation of Pharmaceutical Industries and Associations (EFPIA), SMEs including ChemAxon, and academic departments), all collaborating on identifying novel leads. Some 300,000 compounds will come from EFPIA members’ internal collections and 200,000 from newly synthesised libraries. Out of 13 work packages, ChemAxon is involved in two: sourcing chemical library proposals, and review and selection of chemical library proposals.
Consortia members and “the crowd” submit proposals to a library selection committee. The committee assesses a library based on molecular properties, structural features, novelty, diversity potential, synthetic tractability, and innovative design. If approved, a library is validated, optimised, and then produced. To aid the committee, ChemAxon has produced a Web application, featuring a workflow engine, a job queuing system, a document store, report generation, automated e-mail notification, role management, and a dynamically generated user interface. The workflow engine has three workflows for different types of user: consortium members, non-citizens of the European Union (EU) or European Economic Area (EEA), and citizens of the EU or EEA. There are different levels of access for submitters, committee members and the committee chairman, and others. The user interface is dynamically generated to ensure that users see only what they are supposed to see and only do what they are supposed to do. Tim showed screenshots from registration, entering a library, initial property calculations, submitting the library, final property calculations, and library assessment. He concluded with this statement from Adam Nelson, Chair of the Library Selection Committee:
“The Library Selection Committee has been using the Web tool since the start of 2014 to support the selection of library proposals. Since then, well over 100 library proposals have been considered, around 80 of which have been approved for synthetic validation. The procedure for assessing and processing the proposals has been straightforward. The tool has allowed the proposals to be handled confidentially and to be assessed rigorously; it has saved huge time for both library proposal submitters and members of the Library Selection Committee.”
Nóra Lapusnyik introduced the partner session with comments about “speedcubing discovery”. This year sees the 40th anniversary of the Rubik cube, a Hungarian invention. Ernő Rubik took a month to solve the cube; the current record is less than 6 seconds. In an analogy to solving the cube, the biggest challenge in drug discovery is trying to analyse and finally make a decision using data from diverse data sources: you can get confused trying to connect the pieces. The solution is to manage your structures with care, and index them and store them with JChem Base, allowing connections to other applications. ChemAxon has about 40 advertised partners making connections. At this meeting there were presentations from:
- François Beillouin of Agilent (Presentation)
- Jan Holst Jensen of Biochemfusion (Presentation)
- Burkhard Schäfer of BSSN Software (Presentation)
- Steve Yemm of Core Informatics (Presentation)
- Aaron Hart of KNIME (Presentation)
- Mark Gonzales of LabWare (Presentation)
- Guy Singh of Linguamatics (Presentation)
- Matthew Segall of Optibrium (Presentation)
- Bernhard Schirm of Quattro Research (Presentation)
- Andreas Witte of Schrödinger (Presentation)
The first talk, from Edith Richter of Boehringer Ingelheim, had the theme of managing a relationship to get best results. Two partners, ChemAxon and Boehringer Ingelheim, have a thousand differences, but one goal: to achieve migration of the ISIS platform to ChemAxon tools. ChemAxon’s mission is to enable scientists to manage their chemical and related data via intuitive, powerful and cost effective informatics tools, developed together with customers and partners. Boehringer Ingelheim’s philosophy is to serve patients, attract and retain talent, and act with integrity, honesty, transparency, fairness and full regulatory compliance. ChemAxon is 16 years’ old and has 140 employees. Boehringer Ingelheim has 47,000 employees at multiple sites, and it has 1,400 IT staff globally.
The company made the decision in 2012 to migrate from ISIS to the ChemAxon tool set as the chemistry infrastructure platform. The first phase of the migration, in 2013, focused on cartridge technology and basics; the second, in 2014, focuses on desktop applications. The migration necessitates big changes. For a small company like ChemAxon making a change is akin to turning round an agile motor boat but for Boehringer Ingelheim, with its large number of users and their global distribution, implementing software changes is inevitably more complex. For example, the company has standards and policies such as the limited number of versions of Oracle and Linux that must be used for all systems. New things have to fit into the existing structure. ChemAxon does try to respond when told about these challenges.
Another challenge is horizontal and vertical communication. A big company has project lead, enterprise architect, business consultant, software developer, and database administrator functions, and many others. In a small company one person can do database administration, networking, licensing etc. A ChemAxon employee is more focused on a product (e.g., JChem Base or Calculators) than on a function. Support is now handled differently: the two companies are working together on the PC to get a solution. Roadmaps are relevant: planning what functionality appears in each release. Boehringer Ingelheim may not accept all releases. Dependencies must also be considered. Boehringer Ingelheim uses not just ChemAxon tools, but also IDBS software and Spotfire. If one of these ChemAxon partners uses JChem version 6.1, Boehringer Ingelheim cannot implement JChem version 6.3. Instant JChem, JChem for Excel and Marvin are horizontal tools across all databases: there are dependencies here too. The best approach is to talk and explain, listen and be aware, and stay enthusiastic and dedicated. (Presentation in our Library)
Novartis Institutes for Biomedical Research (NIBR) introduced IJC to replace ISIS in 2011. Anna Pelliccioli said that they had hundreds of little ISIS databases and a few big ones. The data had to be exported as SDfiles or RDfiles via IJC into local and shared databases under JChem Cartridge. It took two years to complete the migration. IJC was deployed via Java Web Start to make software updates and bug fixes easy. Ad hoc customisation for single, shared projects was done using Groovy scripting, but the code incurs a maintenance burden. NIBR-specific features for the whole user community were implemented with Java plug-in technology. Here support is easier as the code is stored in one place.
Preparative laboratory databases, analytical data and related structures, the Radioactive Compound Inventory, Building Block Archive, Chiral Separation Database and NIBR IP Priority Database (including ChemAxon’s Markush Search Plugin) are now shared, searchable databases. Scientists collaborating in teams use IJC to enter data into the databases or retrieve data from them. Groovy scripts were used to create a custom application for the Radioactive Compound Inventory.
Some large structure collections that do not fit into the data warehouse are nevertheless useful to a wide community. These include the Vendor Sample Database (more than 10 million structures), legacy combinatorial chemistry library collections, and Accelrys’ Available Chemicals Directory (ACD). In these cases a data administrator maintains the JChem databases, and scientists access them using IJC. The SciQuest enterprise resource management (ERM) and IJC share the same ACD data in Basel, and ChemAxon’s Test to Production metadata migrator transferred the IJC project from Basel to other ERM instances in Shanghai, Singapore, Cambridge (US) and Emeryville. Work is in progress on converting the Accelrys Metabolite database.
Assay data, batch registration and calculations required in-house, Java plug-in development. Assay data is retrieved from the NIBR data warehouse using Web services and the Data Analysis and Reporting Tool (DART); forms are created and the data are annotated. IJC sits on local databases and shared databases, and links the scientists with DART and the Web Services. Scientists also access the batch registration service through IJC. They validate data against the registration service (which accesses the structure repository), highlight errors, fix the data in grid view, register from within IJC and get back identifiers. Calculating values for QSAR modelling, and other physicochemical properties, is done using the same framework as that currently used for Spotfire and KNIME access. The scientist enters a SMILES or compound number in IJC and the calculation is done via the Cheminformatics Framework (CiX), a platform where models are accessible to the whole NIBR community via Web Services. (Presentation in our Library)
The session ended with another talk from Edith Richter, this one more technical than strategic. As part of the migration from ISIS to ChemAxon, Boehringer Ingelheim replaced Cheshire with ChemAxon Standardizer and Structure Checker facilities on client and database levels. In an analogy with word processing, standardising is equivalent to translation, Structure Checking is equivalent to spell-checking, and Structure Checking with Fixing is equivalent to auto-correction.
In four steps, Boehringer Ingelheim reviewed and discussed its existing rules, decided about Standardizing and Structure Checking, implemented customised checkers with ChemAxon, and provided templates. Some of the features that Boehringer Ingelheim required were not initially available, so ChemAxon carried out some customisation. The final templates can be called by both JChem Cartridge and Marvin tools, enabling an identical structure checker, compliant with internal guidelines, across different tools.
Packages provide functions for single- and multi-step checks and fixings, plus functions for automated checks. Database tables store the default checker and standardising configuration; customised results messages for checks; and dedicated structures for tests. Basic checkers are implemented as abbreviated groups and pseudo atoms etc. Substructure checkers work with substructures defined as SMARTS. These can be set individually in a very flexible approach. Customised checkers were provided from consultancy projects with ChemAxon; Boehringer Ingelheim was happy with the service and found the checkers easy to implement both for the database cartridge and Marvin tools. (Presentation in our Library)
Aptuit has implemented compound registration and integrated discovery data management with the ChemAxon platform. Marco Brazzarola gave a presentation about this. Aptuit is a CRO with five sites, more than 700 employees, and more than 600 customers over the years. When the company started up, without its former pharmaceutical industry legacy tools, the initial priority was to set up simple tools and processes to enable business operation. From the second year on the company defined a roadmap for drug discovery information systems, focusing on building incremental layers of data integrity. Key systems were the integrated biological data system (IBDS) project for handling assay results, and the ChemAxon registry and inventory projects.
In 2010 the company changed from being a pharma R&D centre into an integrated contract discovery and development organisation. It has been collaborating with ChemAxon since 2011. A tactical solution to the registration system was put in place early in 2011, based on Oracle database tables and the IJC client as a user interface. A temporary inventory system was put in place based on Excel files plus a Visual Basic for Applications macro to retrieve data from the registry.
In 2012, after a project with another vendor failed, ChemAxon compound registration was selected. A first version of the IBDS system was built, together with one customer, based on Oracle tables, with IJC used to load biological data results starting from an Excel template: a centralised service accessing biological and chemical data together.
In 2013, after an assessment of other commercial products (all of them too expensive), Aptuit IT started to implemented a bespoke inventory system based on the Microsoft .NET framework. A project progressed to customise ChemAxon’s compound registration based on Aptuit’s requirements, and to integrate it with the Aptuit inventory system. A final version of IBDS Loader was released to scientists in 2013; IJC version 5 was used to visualise and search biological and chemical data together.
In February 2014, Project And Client MANagement (PACMAN), a bespoke solution based on Microsoft .NET to manage clients and projects was released to administrators, together with Attribute Based Access Management System (ABAMS), a single bespoke solution to manage user access and data segregation centrally for all the other systems. The inventory system was released to two trained CMS people and 50 trained chemists. ChemAxon Compound Registration version 6.1 customised for Aptuit, and integrated with the Aptuit inventory was released to 50 trained chemists. IJC version 6 was released to scientists for visualising and searching chemical and biological data from the registry, inventory and IBDS systems through IJC shared projects. A new version of the IBDS system integrated with PACMAN, ABAMS, the registry and inventory was released to 25 scientists. In 2014-2016 Aptuit plans a complete roll-out of the IBDS system, an ELN, integration of plate management within the existing systems, assay requests tracking, and client real time data. Marco showed some screenshots from the current discovery systems.
In a rapidly changing environment ChemAxon gave Aptuit flexibility, skills complementary to those of the internal IT staff, the ability to work by modules, allowing incremental commitment, easy integration with existing bespoke systems and those being built, a fast response to changes in the process and requirements, and dedicated experts. On the other hand, ChemAxon lacked visibility on a wider strategy and it would have been nice if a single, dedicated person had been available. (Presentation in our Library)
Brian Kreiser of Niels Clauson-Kaas spoke next. Niels Clauson-Kaas is an employee-owned small company (28 staff) with 58 years’ experience in synthetic organic chemistry, process development, manufacturing of active pharmaceutical ingredients and related services. The founder of the company had made a reaction “database” on index cards. The reactions were used to make a ChemBase database under DOS from 1990 on. More than 8,000 reactions were stored in the database, which was maintained until recently using a DOS emulator. By 2013, problems were evident. The DOS platform is obsolete and few people in Niels Clauson-Kaas were comfortable using it, so important chemical know-how was not fully exploited, as the project managers did not check the reaction database. CambridgeSoft (as it was then) was unable to convert the database into a contemporary system so Niels Clauson-Kaas turned to ChemAxon. ChemAxon offered to test the ChemBase data, free of charge and a successful pilot project was performed on a small part of the database. The full ChemBase database was then converted to Instant JChem. Project managers are now actively searching the database for information and are able to add new reactions to the database. Access to historical reaction data is now preserved for the future.
Péter Cseh of ChemAxon outlined the technical details of the project. ChemAxon even considered reverse engineering the binary data but agreed to exploit all other possibilities first. The first idea was image recognition and partial import with manual fixing afterwards. Then ChemAxon looked at the text format; the current importer could import it, but lost some information. Both processes would suffer from the human error factor, and checking the results would cost almost as much as the actual import. Both conversions would take three months to complete, even with outsourcing.
Eventually ChemAxon managed to find a partial description of the obsolete 1999 MDL Chemical Personal Software Series (CPSS) format used in ChemBase (not a standard .rxn format). Armed with the documentation, they were able to convert everything into a relational database automatically, using a customised converter. There were a few problems at proof of concept stage. The CPSS files contained intermediates in two-step reactions which were not full multi-step reactions so ChemAxon created a relational database which contained the reactants and the products in one table and intermediates in another. They also found several invalid entries in the database, but Niels Clauson-Kaas helped to fix those issues before the full conversion exercise. The conversion was done in days after Niels Clauson-Kaas sent ChemAxon a “frozen” version of the data. The overall project started in December 2013 and ended at the beginning of February, on budget and on time. The converter is reusable if anyone else should need to convert a ChemBase database. (Presentation in our Library)
Richard Bolton of GSK outlined how his company’s relationship with ChemAxon had evolved over the years. GSK began a major chemistry tools simplification effort in 2008: Chemistry Research IT simplification programme (CRISP). In 2013 this began again as “IT Fitness” across Discovery. The ChemAxon chemistry engine was chosen to replace Daylight, Accord and the MDL Relational Chemical Gateway (RCG) back-end systems, and IJC was chosen to replace ISIS. In 2009 GSK began to work with ChemAxon tools to replace a complex collection of legacy tools in a project planned to take approximately 24 months. ISIS was replaced with IJC; Web Services that powered chemistry methods had ChemAxon methods added, and computational scientists began to use the ChemAxon toolset in place of the Daylight toolkit. The success of all this work led to a GSK-ChemAxon collaboration on small molecule registration in March 2011.
GSK has given many related talks at ChemAxon UGMs:
- Migration of Oracle cartridges and remediation of Web Services. SOA-at-GSK
- Implementation of ChemAxon in an SOA environment
- SAR analysis in Excel using Helium and JChem
- Deploying Instant JChem on an enterprise scale
- Delivering Instant JChem to the masses
- A novel approach to pharmaceutical registration
- Going live with registration as a service
- Registration as a service
IJC as a replacement for ISIS Base was rolled out in the middle of 2011. Bespoke tools created the test-production environments. Unfortunately slow performance in the United States necessitated Citrix. By that time GSK, Novartis and BMS all wanted ChemAxon enterprise scaling. GSK evaluated the browser based Plexus Discovery in 2013 but the GSK database architecture made performance slow. In 2014 an upgrade to JChem version 6.2 enabled GSK to use ChemAxon Services to replace the test-production publication mechanism but the company is still using Citrix for US users. (Databases are mostly in the United Kingdom.) In 2015 it is proposed to upgrade to Plexus Discovery resourced by ChemAxon Services.
Registration was to be simplified by conversion to a service but GSK’s legal experts raised concerns about hosting registration outside the firewall. In March 2011 a collaboration with ChemAxon was begun to create a new registration service, managed by ChemAxon but inside the GSK firewall, to comply with GSK business rules using standard formats. This went into production in March 2012. Registration is currently approximately 50% automated, that is, application of business rules is automatic. The system checks that the structure is drawn with enough information for all the stereocentres and relationships to be defined completely, and it checks that the structure complies with business rules. A “registration as a service” collaboration with ChemAxon began in the middle of 2013 to deliver feedback capability as part of the “product”. The target is 70% auto-registration with potential for 80%. For compounds that do not register automatically the scientist will be able to do editing and “fixing” using new compound checker tools. If a structure is too complex it can be sent to a registrar to complete. The roll-out of this solution is imminent.
Chemistry has been added to the GSK Enterprise Search engine using ChemAxon technology. GSK’s Socrates Search project won the 2013 BioIT World Best Practices Award for knowledge management. Socrates Search uses NextMove Software’s LeadMine and HazELNut products, in conjunction with HP Autonomy and ChemAxon’s JChem Cartridge. GSK data are crawled (ELNs, Documentum and Lotus) and the chemistry is extracted into an Oracle database with a ChemAxon powered index. JChem libraries were used for format conversions and image rendering. Socrates search also integrates with some external sources (Reaxys, PharmaPendium, NCBI, Clintrial.gov) through Web Services.
In future GSK plans to find a supported version for its chemistry representation in Spotfire: ChemAxon is developing a Spotfire plugin using a GSK tool as the model. GSK’s computational scientists are evaluating in-memory databases. The company is working with ChemAxon to ensure that the GSK roadmap aligns with delivery plans for Plexus, and is in partnership with Schrödinger to integrate the Plexus library design tools into LiveDesign. Support services for ChemAxon tools also need consideration: it should be possible to maintain component currency without starting up a project. ChemAxon first became a key vendor for GSK in 2009; it became a collaborator in 2011, and a service provider in 2014. Five years on, the JChem Platform is now the default GSK “chemistry engine”.
Roger Sayle of NextMove Software has been using ChemAxon tools in connection with an ISO standard. In the world of cheminformatics standards, a recent trend has been the promotion of authority-prescribed “de jure” standards (such as InChI and HELM), over the more traditional “de facto” standards (such as molfiles or SMILES strings). Voluntary “de facto” standards are selected by the community but obligatory “de jure” standards are imposed on an industry, often by bureaucrats and lawyers rather than experts. A recent example of this is ISO international standard 11238, entitled “Health informatics - Identification of medicinal products - Data elements and structures for the unique identification and exchange of regulated information on substances” or “IDMP”. This standard covers file formats for exchanging chemical structures between government agencies including the US Food and Drug Administration (FDA). An EU regulation also requires adoption of all five IDMP standards by 1st July 2016.
The framework requires use of non-semantic, random, fixed length unique identifiers that include an internal integrity check. The identifiers will be very similar to the FDA’s UNII (UNique Ingredient Identifier). The fixed width, non-semantic requirement rules out the use of SMILES, InChI, and V2000 molfile. The random requirement rules out CAS Registry Numbers, PubChem CIDs and ChEMBL IDs. The use of InChIKeys or similar hashes of connection tables and text may be possible.
ISO charges for access to official standards documents but late drafts of ISO 11238 are freely available on the Internet. Unfortunately, many of the technical examples were removed from the final standard and are due to appear in the upcoming “Implementation Guide(s)”. The standard requires at least one substance name or company code to be associated with each substance. One way to guarantee the existence of a suitable substance name is to use IUPAC naming software (such as ChemAxon’s) during submission to the unique coding authority. As an aside, Roger commented that the coverage of CXN’s structure-to-name software is state of the art.
Unfortunately the document that Roger used has been typeset by editors who have inadvertently changed whitespace without appreciating the impact this has on chemistry tools. In Annex A there are SMILES strings and InChIs with spurious spaces; the “InChI=” prefix is stripped; and the fact that molfiles use fixed width columns and blank lines is overlooked. These unintentional typographical errors may perhaps be the result of poor fonts (with the exception of “InChI=”) but the content of the original Annex B from the draft indicates that these issues were more widespread and may arise from ignorance of cheminformatics file formats.
All is not lost. NextMove’s CaffeineFix technology can perform “spelling correction” on SMILES strings, InChI and molfiles. As with IUPAC-like systematic names, these can each be specified by a formal grammar. The regular expression describing a molfile is compiled into a “finite state machine” (FSM) with 1333 states. The only allowed “corrections” are the deletion of new lines and the insertion of spaces or new lines, but only where permitted in the grammar/FSM. Depth-first recursion is used to identify a minimal set of edits to correct the input. This can be implemented in the ChemAxon toolkit.
A similar spelling correction variant that allows uppercase characters to be mapped to lowercase, and the prefix “InChI=” to appear at the start of a string can also be used to fix many InChIs, for example “1S/C17H21CLN4O/C1-22-12-3-2-4-13…” to “InChI=1S/C17H21ClN4O/c1-22-12-3-2-4-13…”. Roger took 1.35 million standard InChIs from ChEMBL, converted lowercase to uppercase, fixed the InChIs and checked whether the original InChI could be regenerated. The roundtrip was 99.5% successful. He discussed four case-insensitive ambiguities: 6,562 examples of BRh/BrH, 12 BRf/BrF, 4 CS/Cs, and 18 BI/Bi. All four cases can be fixed by parameterising the model. The first two are practically unambiguous. The other two require taking advantage of the fact that in drug molecules the Cs and Bi are typically counterions.
The Java source code for recovering molfiles and InChIs from corrupted versions seen in the ISO 12238 draft has now been contributed to the ChemAxon forum, allowing Marvin and JChem to read the examples given in that document. Whether this functionality will be required to support the pending Implementation Guide requirements remains to be seen. Attention to detail is important in standards writing. Roger suggests that ISO 11238 IDs may become as popular as CAS Registry Numbers.
First, Mark Davies of the European Bioinformatics Institute (EBI) talked about open patent data. The EBI, part of the European Molecular Biology Laboratory (as EMBL-EBI) offers a large number of resources in genes and genomes; gene, protein and metabolite expression; protein sequences, families and motifs; molecular structures; chemical biology; systems; reactions, interactions and pathways; and literature and ontologies. In December 2013 EMBL-EBI acquired SureChem, a chemistry patent mining product from Digital Science, rebranded it SureChEMBL, and will provide it as an open resource to the community. Links are included in UniChem, a non-redundant database of pointers between chemical structures and EMBL-EBI chemistry resources. UniChem can create structure-based hyperlinks “on the fly”, by use of REST Web Services. At the time of writing, UniChem has 66 million unique chemical structures from 24 data sources.
SureChEMBL, accessible via a Web interface and API, will be searchable by free text keywords and Lucene fields, patent IDs and bibliographic information, patent authority and date, and chemical structure. Users will be able to retrieve chemistry (with additional filters), patent family information, and annotated full patent text. The RESTful API is well documented and KNIME nodes will be made available.
The database content (from WO, US, EP and JP patents) includes exemplified structures from the patent title, description, abstract and claims; structures from text from 1976 onwards, and structures from images from 2007 onwards. The US Patent and Trademark Office (USPTO) has provided “Complex Work Units” (CWUs) since 2001. CWU file types (molfiles and CDX files) are processed as part of pipeline. After entity extraction, five different name-to-structure programs are used, and one image-to-structure program, Chemical Literature Data Extraction, CLiDE. ChemAxon’s name-to-structure and database building software products are used. The system is currently built and optimised to run on Amazon Web Services.
In future new entity types (proteins, diseases, cell lines, bioassay data and perhaps even Markush structures) might be added. Optical Structure Recognition Application, OSRA and GGA’s Imago compound image extraction could be added. Funding has been approved to include SureChEMBL data in Open PHACTS and perform Resource Description Framework (RDF) conversion, target indexing and API development. These are just a few of EBI’s ambitious future plans. (Presentation in our Library)
Alfonso Pozzan of Aptuit also talked about Document to Structure applications. Eighteen months ago Aptuit started a project to extract information from patents using ChemAxon’s document-to-structure (D2S) software. It was possible to search more than 20 million structures from patents using what was then the SureChem Open application (now SureChEMBL). In one case study Alfonso used 23 recent patents concerning Orexin1 and Orexin2 antagonists. Four of these patents were in Japanese and were first translated into English using the Google patent translation facility. Structure extraction was carried out using ChemAxon’s molconvert (version 6.2.1) and OSRA (version 2.0.0).
Using very simple cut-offs like molecular weight less than 600, six or more heavy atoms, and a “calculable” ClogP value, the initial 30,000 non-unique, raw “structures” were quickly reduced to about 13,000 non-unique structures. Amongst those structures it was possible to identify those derived from D2S or OSRA conversion, and to have an initial idea of the type of structures and fragments extracted from a patent. These initial considerations led Aptuit to replace the initial cut-off with the following one:
- Molecular weight greater than zero. (This is an effective way to remove “artefacts” with molecular weight of zero, typically arising from terms such as “halogen” and “alkyl” within the patent.)
- Remove molecules containing atoms other than C, N, O, P, S, F, Cl, Br, and I.
- Remove molecules that do not contain any carbon atom.
The combination of these three filters removed about 18,000 molecules from the initial set of 30,000. Further insights about the population of “molecules” extracted from the patents were made by computing three simple descriptors for each molecule. The first one is the number of intra-patent duplicates; over-duplicated molecules within the same patent could be of no interest as they might represent internal standards or common reagents. The second descriptor is a simple heavy atoms count. The third descriptor is the count of unique linear fragments (1 to 7 atoms in length) of each molecule computed from the 2D connectivity table using code developed in-house originally to generate hashed fingerprints based on the Daylight hashed fingerprints concept. This descriptor could be considered as a very simple index of molecular size and complexity. It also has the advantage of identifying highly symmetrical molecules which tend to have a relative lower complexity in terms of number of fragments compared to their molecular weight. A final set of cut-offs was identified after visual inspection, by keeping only molecules with 22 or more heavy atoms, a number of unique fragments (1-7 atoms in length) ≥ 130 and no more than three intra-patent duplicates.
All the initial 23 patents were represented at the end of the analysis but there was wide variability in the number of structures extracted from each patent. A more detailed analysis was carried out at single patent level. For example, from patent WO2004026866A1, considering two structures differing only by a nitrogen atom in one ring, only one structure was recognised. This problem seemed to arise more from an OCR issue than from a name-to-structure fault (sometimes the letter “l” is misinterpreted as the number “1” and vice versa).
Alfonso has converted his 2D patent structures into 3D ones and has built a system where medicinal chemists can carry out pharmacophore screening and shape similarity. Flipper, fixPka, Omega2 and ROCS from OpenEye were used to prepare and generate 3D conformations, and perform shape similarities; the Molecular Operating Environment, MOE was used for the pharmacophore analysis and screening. Alfonso showed an example of a selectivity plot made after semi-automatic structure and IC50 extraction from WO2013050938A1. D2S was needed but not sufficient: a combination of OCR and Perl scripting and parsing was necessary.
Alfonso concluded that D2S and OSRA combined with an appropriate post-processing workflow can effectively increase knowledge on a particular target of interest. With effective post-processing, large numbers of patents can be mined. A high level of automation can be achieved. Nevertheless, Aptuit’s process is still far from being perfect; extracting biological data and associating them with structures can still take some time, due to patent variability which makes full automation in the post-processing more challenging. (Presentation in our Library)
Serge Parel of Exquiron entitled his talk “Farewell, Pipeline Pilot”. At the end of 2103 Exquiron was using the ChemAxon JChem Chemistry cartridge on Oracle, Accelrys Pipeline Pilot as a workflow platform, Accelrys Pipeline Pilot Chemistry Library as the cheminformatics platform, and Exquiron’s own biological data warehouse (“NaviGator”) on Microsoft SQL Server. The crunch came when the financial terms for renewing the Pipeline Pilot proved to be unsuitable for a small enterprise with only 10 users. The Pipeline Pilot Chemistry Library and ChemAxon’s Pipeline Pilot components had duplicate functionalities.
The solution was to replace Pipeline Pilot with KNIME and the Pipeline Pilot Chemistry library with ChemAxon (Infocom) nodes. Migration of lots of short protocols was done in-house. Dose-response data reporting was implemented by ChemAxon consultants with support from KNIME. The exhaustive hit expansion toolbox has two parts; ChemAxon consultants built the hit expansion functionality and the data fusion and analysis functions were written in-house.
After a hit finding campaign, chemists want to see structures and biological results; biologists want to see biological results and dose-response curves. The workflow for the reporting application involves first making an SDfile with data from NaviGator, from the JChem cartridge (or an SDfile) and additional text input. Report elements (structures, results tables, and dose response curves) are generated from the SDfile and combined to make a PDF file for reporting. The data crunching stage of making the SDfile had to be implemented differently in KNIME but KNIME helped with a workaround. The KNIME solution for generating the dose response curves is also different, but is more elegant than the Pipeline Pilot one, and a results table and dose response curve is finally displayed satisfactorily alongside each structure.
Serge’s second example was a hit expansion workflow. Andreas Bergner and Serge have demonstrated that template virtualisation by bioisosteric enumeration and other rule-based methods, in combination with standard similarity search techniques, represents a powerful approach for hit expansion following high-throughput screening campaigns (Bergner, A; Parel, S. P. Hit Expansion Approaches Using Multiple Similarity Methods and Virtualized Query Structures. J. Chem. Inf. Model. 2013, 53, 1057–1066). A workflow was built that incorporates Turbosim similarity search and data fusion (Hert, J.; Willett, P.; Wilton, D. J.; Acklin, P.; Azzaoui, K.; Jacoby, E.; Schuffenhauer, A. Enhancing the effectiveness of similarity-based virtual screening using nearest-neighbor information. J. Med. Chem. 2005, 48, 7049-7054; Willett, P. Similarity-based virtual screening using 2D fingerprints. Drug Discovery Today 2006, 11, 1046-1053).
The functionalities of the workflow were all ported to KNIME using the ChemAxon/Infocom nodes, with the exception of the pharmacophore graphs. The execution speed of the workflow had to be improved since the current Turbosim implementation was very slow due to the large number of similarity searches performed. FCFP, ECFC and MACCS keys are used as used as fingerprints in the similarity searches. After calculation, these fingerprints are stored into a KNIME table file, and the structures are stored separately into a local JChem database. In the Pipeline Pilot method all structures and fingerprints were carried through the workflow; in the KNIME version you load only what is needed when you need it. The KNIME similarity search implementation is orders of magnitude faster.
Ring generalisation was also carried out: each atom type is set to “any” if it is in a ring and each bond type is set to “any” if it is in a ring. This procedure could not be performed in Pipeline Pilot with PilotScript and the Chemistry Collection but it works perfectly well in KNIME by calling the ChemAxon API in a Java Snippet.
Bioisosteric transformations were previously applied in Pipeline Pilot to generate new virtual template structures but there is no bioisosteric transformation node available in KNIME. Bioisosteric transformations were therefore implemented as a KNIME metanode using a loop iterating through each reaction file and applying it to the incoming template with a ChemAxon/Infocom UniReactor node. This works well, but requires a strong atom-to-atom mapping in the .rxn file (which was not the case under Pipeline Pilot). A few problems remain which may be due to the way UniReactor handles reactions with two attachment points.
There are few issues still to be addressed in the new systems. RAM memory requirements are high, at least for the desktop version of KNIME, when dealing with large SD files; and loops can sometimes be quite slow. There is currently no KNIME implementation for the DiscNgine pharmacophore graphs, but it might be possible to implement Screen3D instead; this is under evaluation. (Presentation in our Library)
KNIME has been used a great deal at Lilly since 2010, and includes open source contributions by Mike Bodkin’s computational chemistry group in the United Kingdom. (Mike is now at Evotec.) James Lumley, who spoke, is in the IT consulting group at Lilly which helps business areas with the final 20% of workflow implementation, and has built an infrastructure to develop and deploy KNIME workflows globally. There are many reasons to like KNIME, including fast integration with existing legacy systems and data, helping to drive science and IT collaboration and knowledge capture on the server, a strong precedent for workflow software in pharma, and easy creation of custom extensions (including a security model) through the Java/Eclipse platform.
Half of more than one hundred nodes used internally in Lilly rely on a service-oriented architecture (SOA) to provide tools and data to KNIME. Many nodes depend upon ChemAxon including chemical structure conversions, sketcher, molecule difference checking, and rendering. Many nodes link legacy systems. It was necessary to retain the “trusted” status of internal data access tools such as the system for integrated data access, Mobius, and to retain the power of in-house predictive modelling code such as support vector machine (SVM) models in UNIX code. Interfaces with new systems (such as the structure verification tools in analytical technologies) would also be needed. Usability was aided by adding a dashboard for service layer monitoring. SOA allows data security issues to be moved to the Web Service layer, reducing the load on “office”’ CPUs, but the Services need constant monitoring and a lot of work was needed to add the Microsoft NT LAN Manager (NTLM) authentication protocol and related security support to Web Service nodes.
Another area for workflow improvement was structure conversion. Converter nodes are among the top 20 most commonly used nodes in an analysis of more than 2000 workflows on the Lilly KNIME server and some workflows contain around 50% converter nodes. New users are confused by multiple molecule types and conversions (SMILES, CDK, and RDKit etc.). Lilly wanted to remove the need for users to add chemical converter nodes manually and wanted to ensure that nodes that use different chemical formats work together better.
KNIME.com introduced an “Adaptor Cell” in version 2.9. This is a container with several representations of the same entity; a node can add additional representations that can be re-used by downstream nodes. This avoids multiple conversions and the original representation is still present. It is vendor specific, with no “pseudo standards” such as SDfile. Lilly is developing an extension point for handling molecule type conversions that depends on the Marvin library. The extension point moves conversions into the node configuration, while the workflow still documents explicit type conversions and still retains support for converter nodes if and when appropriate. Converters could be “chained” if direct conversion is not available. This Lilly solution will be released as open source code. (Presentation in our Library)
Brock Luty of Dart NeuroScience (DNS) described a ChemAxon/KNIME-based tool for designing chemical libraries. At DNS about 20 chemists are involved in the design and creation of chemical libraries. They needed a chemical library design tool to select reactants, enumerate products, and calculate properties prior to analysing and filtering the products. Since limited IT support was available, no one had time to spare, and the chemists were already overloaded with software, the IT approach was to standardise calculations and reactions, to simplify the system by wrapping processes and minimising import/export operations, and to enhance capabilities and speed by doing calculations remotely. The idea was for increased ease of use to result in decreased support, leaving everyone happy and productive.
Dart NeuroScience had already licensed JChem, Spotfire, and OpenEye’s Rapid Overlay of Chemical Structures (ROCS); the “glue” for the system was a customised set of KNIME nodes built using both Infocom’s JChem Extensions and the JChem API. The solutions use a service-oriented architecture, with domain Create, Read, Update, Delete (CRUD) graphical user interfaces written for specific entities using the model-view-controller (MVC) software architectural pattern. Traditional stateless computational services, such as property calculation or enumeration, can be based on scripts using command-line applications or written on KNIME. All the heavy lifting is moved to the servers, where parallelisation is automated, and KNIME is used as a service orchestration layer.
The “diversity elements” node allows chemists to select reactants by class, filter them by substructure, and import a list. The “deduplicate” node removes functionally equivalent reactants (e.g., salts). There is a node for selecting a reaction type from a set of 40 curated types: a reaction browser displays the reaction title (e.g., reductive amination), the classes of reactants expected as input (e.g., aldehydes and amines in the case of reductive amination) and an example reaction. Reaction images are clickable and linked to wiki pages with more detail on, for example, selectivity. Chemists enumerate potential libraries with ChemAxon’s Reactor, using customised nodes that can contain multi-step workflows.
Standardised KNIME nodes call back-end services on a high-performance computing grid to enable computationally intensive calculations (e.g., logP and ROCS scores), with result sets pushed back to the user on reconnection. Poses can be viewed in OpenEye’s Visual Interface to Drug design Applications (VIDA) or exported to Spotfire. ChemAxon chemical fingerprints are used in a “cluster” node that allows different clustering methods.
Exporting products to Spotfire is a two-step process: the “export for Spotfire” node, and the “publish for Spotfire” node that launches the application. Selections are made in Spotfire, and new nodes with the selected products and reactants are returned to KNIME. In another node, the stereochemical codes needed for registration are assigned based on structure. The final node publishes a library design plan containing separate SDfiles for the products and reactants, along with a .csv file listing how many times each reactant is used. The zipped file is parsed on import into a chemist’s Agilent ELN. (Presentation in our Library)
Ádám Andor Kelemen of the Hungarian Academy of Sciences entitled his talk “Physicochemical property based scoring scheme for design of an aminergic-GPCR targeted fragment library. Fragment GPCR Score”. Several fragment-based drug discovery techniques have been validated for use with GPCRs in recent years. These approaches and their use in hit identification have been reviewed (Andrews, S. P.; Brown, G. A.; Christopher, J. A. Structure-Based and Fragment-Based GPCR Drug Discovery. ChemMedChem 2014, 9(2), 256–275).
The goal of Ádám’s work was the design of a physicochemical property based scoring method for fragment-based drug discovery. The method was developed for sorting commercially available and virtual libraries, in order to compile an aminergic-GPCR targeted fragment library. Ádám and his colleagues examined the physicochemical characteristics of GPCR-promiscuous fragments, and converted the essential parameters into “desirability functions”.
Rules for the scoring scheme are derived through data mining of ChEMBL using IJC and KNIME and examination of the physicochemical descriptors used in Rule of Three and Rule of Five using JChem for Excel. The 947,914 entries in ChEMBL GPCR-SARfari were filtered (by desired activity, salt stripping, number of heavy atoms etc.) to 10,477. Of these, 2,370 actives had a size independent ligand efficiency ≥ 1.95 on at least four GPCR receptors, and of these 2,183 had activity on only aminergic-GPCRs. The inactive training set was 5,000 compounds extracted from ChEMBL (less the actives) by random sampling. Ádám displayed frequency distributions for various calculated properties and pointed out some trends. Actives, for example, have lower polar surface area (PSA) and fewer rotatable bonds. His aminergic-GPCR-active fragments were a set of small-sized, rigid molecules containing few heteroatoms that are mostly basic nitrogens. The desirability function maps the value of a property onto a score in the range of 0 to 1. The Fragment Aminergic GPCR Score (FrAGS), ranging from 0 to 6, is the sum of the desirability functions.
Validation was carried out on ChEMBL, PubChem, and two datasets from Richter Gedeon (a fragment screening set and a high throughput screening set). Ádám showed enrichment factors and receiver operating characteristic (ROC) curves for the various datasets. Enrichment rose to about 3 or more above a score of 5 and then fluctuated. In future Ádám hopes to compile an in-house fragment library, use FrAGS to design a GPCR-targeted fragment library, and supplement the in-house library by commercially available fragments and by synthesis. (Presentation in our Library)
Stefan Höck and Rainer Riedl of Zurich University of Applied Sciences described CyBy2, a rich client application for the storage and retrieval of chemical structures and biological activity in an underlying database that facilitates structure-based data management and visualisation of SAR. The CyBy2 server uses JChem, the Derby database, and Akka middleware. The client uses Marvin Beans for drawing and displaying molecules. Its modular design (based on the Netbeans platform) makes it easily expandable. It allows grouping of compounds in projects, linking of additional data files with compounds, linking of biological assay data with compounds, and fine-grained control over user access rights. Details have been published in Höck, S, Riedl, R. CyBy2: A Structure-based Data Management Tool for Chemical and Biological Data. Chimia 2012, 66(3), 132-134.
CyBy2 was written in purely functional style in a modern programming language (Scala). Stefan and Rainer have shown how typical calculations required in chemical applications can be written with a drastically reduced amount of code due to the increase in abstraction and type-safety gained by applying typical concepts from functional programming (Höck S. Riedl, R. Chemf. A purely functional chemistry toolkit. J. Cheminf. 2012, 4, 38). They have also shown how the resulting referentially transparent functions and immutable data types can be used at no risk in all parts of client code including multi-threaded algorithms. They have compared their code with existing toolkits, and shown that it is superior in terms of type-safety, flexibility and thread safety when compared to a typical object-oriented toolkit. They experienced a drastic increase in productivity when programming in a purely functional way. Typically, code that compiled ran as expected and passed all unit tests at first take. (Presentation in our Library)
Céline Labbé of the Institut National de la Santé et de la Recherche Médicale (INSERM) talked about iPPI-DB, a Web application to query a database of protein-protein interaction inhibitors. A paradigm shift is needed in our way of designing chemical libraries aimed at PPI targets. Learning from known successful examples of PPI modulation with small non-peptide inhibitors should help in the discovery of new chemical entities in a more systematic manner, and allow the scientific community to derive general trends such as sets of appropriate physicochemical properties and privileged chemotypes. So, Céline and her colleagues have created iPPI-DB, a database of the structures, physicochemical properties, and pharmacological data (biochemical and/or cellular binding data) of 1,650 small non-peptide inhibitors across 13 families of PPI targets (Labbé, C. M.; Laconde, G.; Kuenemann, M. A.; Villoutreix, B. O.; Sperandio, O. iPPI-DB: a manually curated and interactive database of small non-peptide inhibitors of protein–protein interactions. Drug Discovery Today 2013, 18(19-20), 958–968). The data were extracted from the literature and manually curated by experts. As an online service to the PPI community, the team developed a Web application that can be accessed by anyone from the website of the INSERM technological platform CDithem.
The uniqueness of iPPI-DB resides in the combination of its manual curation and the nature of its querying and visualising tools. The first way to query the database is by choosing pharmacological criteria such as the PPI target, the threshold for the activity of the compound and/or thresholds for some molecular descriptors (molecular weight, proportion of sp3 carbon, etc.). All compounds fulfilling the query criteria are displayed as a list with all annotated properties.
Recently, the team added a capability for users to sketch their own molecule in an embedded Marvin Sketch applet and to submit this molecule as the query for iPPI-DB using a similarity search based on ECFP4 or FCFP4 fingerprints (powered by JChem v6.1). The results of such a query are displayed in the same way as those of pharmacological queries for the five most similar iPPI-DB compounds, based on the type of fingerprints chosen by the user. These results are preceded by a reminder of the input molecule structure using the Marvin View applet, its physicochemical profile (a radar chart showing nine molecular descriptors calculated with JChem), and the molecule’s compliance with Lipinski’s, Veber’s and Pfizer’s 3/75 rules. It is hoped that iPPI-DB and its Web application will assist chemists, biologists and clinicians to design the next generation of PPI modulators more rationally. (Presentation in our Library)
Alex Drijver concluded the meeting. He said that a customer had asked him repeatedly to supply ChemAxon’s roadmap strategy. Alex hoped that the product presentations at the meeting had provided a partial answer. There are answers to the roadmap question at three levels. The first level is “ownership”: is ChemAxon a long term partner and does it have financial stability? ChemAxon remains and is proud to be a privately held company. The strategy of the owners is to build a long term sustainable business; this seems to be coming along nicely, since the company is 16 years old already. Another proof is the fact that the company could have been sold at any time in the last few years (offers have been received). ChemAxon is financially independent, and has not taken on investment or loans, but churns the surplus cash back into the company. The company is fully in charge of its own destiny.
The second level is “market”. The market for synthesis instruments is a dying one and companies such as Biotage and CEM have had to change tack. ChemAxon has branched out a little into agrochemicals, flavours and fragrances, and petrochemicals, but not into other markets in general. It is moving a little into biologics (with registration) but is mainly still focused on early stage pharma research. The company is getting more involved in business model personas, e.g., “I’m a medicinal chemist and I need to…”. There is plenty of growth in the market still.
The third level is the product level: what will the product be like in 3-5 years’ time? In the past, if you asked for a feature, ChemAxon used to work out the cost of doing it and then do it. Doing this you end up with a messy code base. Now ChemAxon asks you why you want to do it. What is the problem? Plexus is a good example of starting from zero. Having just a back-end is not a sustainable business. ChemAxon spoke to 200 people and asked them what their problems were. From that came the Plexus roadmap. Plexus is a large part of ChemAxon’s future. This is delving into end user space. It is now a good solution for the enterprise, with a robust back-end on which to build. Now ChemAxon is ready to build the interface.
Alex ending by thanking everyone: guest speakers from 8 or 9 countries, the ChemAxon team, the marketing and administrative teams, and especially the users. ChemAxon appreciates the time investment that users put into this meeting. The company needs and loves feedback from users.
In my September 2013 conclusion I said that ChemAxon was probably the market leader in mainstream chemical structure handling; we have now nearly reached the time when the word “probably” can be removed from that statement. Although there are companies that can compete with parts of the ChemAxon portfolio, and some of them perhaps perform better in a certain niche, there is no single company that offers the same breadth of solutions or equal value for money. ChemAxon is still small enough to turn around fast in response to changes in user demand (like the agile, small motor boat that one speaker mentioned), yet it is now large enough to command credibility and promise long-term sustainability and financial stability. I have previously mentioned the dangers of undertaking a project as ambitious as that of Plexus, but ChemAxon is responding to new competitive forces and knows that users now want not just a tool kit but off-the-shelf software, and solutions aimed at the benchtop end-user. Partnership has been a very successful ploy and the ChemAxon consultancy group has paid dividends. Cloud computing and hosting are big news now and ChemAxon has not failed to notice. I am pleased that I have already been invited to attend one of next year’s meetings: pleased not just because I find the meetings very useful and enjoyable, with lots of networking opportunities, but also because it is fascinating to watch the changes in this company over the years. Long may it continue its climb up the S-curve and long may I be the rapporteur!