Managing chemical structure data in a biologically driven research consortium - can it be FAIR?
The Library of Integrated Network-based Cellular Signatures (LINCS) program is a national research consortium funded by the NIH to generate an extensive reference library of cell-based perturbation-response signatures, along with novel data analytics tools to improve our understanding of human diseases at the systems level. In contrast to other large-scale data generation efforts, the LINCS project employs a wide range of assay technologies cataloging diverse cellular responses. In our role as the Data Coordination and Integration Center, we have developed infrastructure and processes to handle the entire data pipeline from receiving data and metadata, registration, annotation, mapping to external reference resources, and publication via the LINCS Data Portal (LDP) with the goal to make these resources Findable, Accessible, Interoperable and Reusable (FAIR). Chemical structure search and management in LDP are powered by the ChemAxon JChem PostgreSQL Cartridge. Additional ChemAxon tools are used for curating, standardizing, and processing chemical structure data as well as computing various descriptors. Although chemical compound registration based on chemical structure identity is well established, challenges arise if submitted chemical structures are ambiguous, missing, or incorrect. Validating chemical structures and detecting and correcting mistakes can be hampered by insufficient data and errors in established reference resources, which are often propagated and sometimes originate at the supplier. In a loosely organized research consortium, further challenges include the different operational informatics capabilities, QC practices, and the tendency to identify chemical structures by common names. This presentation will describe our chemical structure processing pipelines and infrastructure attempting to address these various challenges resulting in high-quality and richly annotated curated Small Molecule Landing Pages in the LINCS Data Portal. Although still ongoing, our work suggests an initial estimate of the potential error rate in initially reported chemical structures associated with various biological profiling data. Care must be taken to correct such errors avoiding further propagation in the research community.