Chemical Data Extraction

Extract chemical information from documents

A document of chemical importance, like a chemical patent or a journal article can be numerous pages in length. It would take a scientist a long time to locate a particular chemical structure within it - especially when the structure is in textual format. ChemAxon's core technology for chemical data extraction is called 'Document to Structure', and solving the problem described above is its main target. Each chemical structure is retrieved from any of the most popular file formats. The structures are always returned with their location in the document, and other relevant chemical information referring to the original format (IUPAC name, image file, SMILES etc.). The chemical data can be extracted to a database, or to quite a number of other ChemAxon softwares for further analysis. This powerful text mining function can save scientists hours when reading, analyzing, and understanding a chemical document.


The extraction of chemical data is relying on the underlying chemical name and structure conversion technology. This makes the Document to Structure tool capable of recognizing and converting various types of chemical information, such as:

Extracting and converting chemical information to chemical structure

Document to Structure applies Optical Structure Recognition (OSR) technology for converting images to chemical structures. Currently CLiDE, OSRA, and Imago are supported - using OSR technology might require a separate license. Structure images are distinguished from a non-structure image (e.g. an IC50 plot) to reduce noise in the results. Once the structure is converted, its location is returned, making it easy to find the chemical structure in the text. These features support your text mining, patent analysis, and internal document management perfectly.


Regarding the document formats, Document to Structure supports a wide range of them:

  • TXT, HTML and XML files
  • MS Office documents (DOC, DOCX, PPT, PPTX, XLS, XLSX)
  • Chemical structure objects, embedded into MS Office (drawn by Marvin, ChemDraw, SysmyxDraw etc.)
  • OpenOffice ODT files
  • PDF, even non-searchable PDF files in image format
  • Image formats - PNG, JPEG, TIFF, BMP etc.

Read more about Document to Structure



Document to Structure, as a core technology, enables chemical data extraction in various other ChemAxon softwares. Downloading or using the following applications, you will gain access to Document to Structure capabilities:

  • Desktop solutions - like ChemCurator for deeper analysis of a long chemical patent or other document; Instant JChem to collect chemical data, manage and further analyze it even in collaboration with peers; JChem for Office to extract chemical data into MS Excel and use that database in research workflow; or Marvin for simple visualization of the extracted chemical structures
  • Web-based applications - like ChemLocator to find chemistry in files sitting on your machine or in the cloud, like Google Drive or DropBox; Plexus Suite to gather all chemical data to an online interface for further analysis
  • Workflow tools - like KNIME and Pipeline Pilot

It provides simple or complex functionalities to directly open and search the chemical information in the documents. Many of the above mentioned applications offer additional features to support your data extraction workflows available as a hosted services too.

Document to Structure is also available as a command-line tool for batch processing, and as a Java or .NET API for custom developed systems. Read more about command line usage and APIs