Introduction

Machine learning for information extraction, data analysis and predictions

In healthcare, patient data contains recorded signals for instance, electrocardiogram (ECG), images, and videos. Healthcare providers have barely managed to convert such healthcare data into EHRs. Efforts are underway to digitize patient-histories from pre-EHR era notes and supplement the standardization process by turning static images into machine-readable text. For example, optical character recognition (OCR) software is one such approach that can recognize handwriting as well as computer fonts and push digitization. Such unstructured and structured healthcare datasets have untapped wealth of information that can be harnessed using advanced AI programs to draw critical actionable insights in the context of patient care. In fact, AI has emerged as the method of choice for big data applications in medicine. This smart system has quickly found its niche in decision making process for the diagnosis of diseases. Healthcare professionals analyze such data for targeted abnormalities using appropriate ML approaches. ML can filter out structured information from such raw data.


Extracting information from EHR datasets

Emerging ML or AI based strategies are helping to refine healthcare industry's information processing capabilities. For example, natural language processing (NLP) is a rapidly developing area of machine learning that can identify key syntactic structures in free text, help in speech recognition and extract the meaning behind a narrative. NLP tools can help generate new documents, like a clinical visit summary, or to dictate clinical notes. The unique content and complexity of clinical documentation can be challenging for many NLP developers. Nonetheless, we should be able to extract relevant information from healthcare data using such approaches as NLP.

AI has also been used to provide predictive capabilities to healthcare big data. For example, ML algorithms can convert the diagnostic system of medical images into automated decision-making. Though it is apparent that healthcare professionals may not be replaced by machines in the near future, yet AI can definitely assist physicians to make better clinical decisions or even replace human judgment in certain functional areas of healthcare.


Image analytics

Some of the most widely used imaging techniques in healthcare include computed tomography (CT), magnetic resonance imaging (MRI), X-ray, molecular imaging, ultrasound, photo-acoustic imaging, functional MRI (fMRI), positron emission tomography (PET), electroencephalography (EEG), and mammograms. These techniques capture high definition medical images (patient data) of large sizes. Healthcare professionals like radiologists, doctors and others do an excellent job in analyzing medical data in the form of these files for targeted abnormalities. However, it is also important to acknowledge the lack of specialized professionals for many diseases. In order to compensate for this dearth of professionals, efficient systems like Picture Archiving and Communication System (PACS) have been developed for storing and convenient access to medical image and reports data. PACSs are popular for delivering images to local workstations, accomplished by protocols such as digital image communication in medicine (DICOM). However, data exchange with a PACS relies on using structured data to retrieve medical images. This by nature misses out on the unstructured information contained in some of the biomedical images. Moreover, it is possible to miss an additional information about a patient's health status that is present in these images or similar data. A professional focused on diagnosing an unrelated condition might not observe it, especially when the condition is still emerging. To help in such situations, image analytics is making an impact on healthcare by actively extracting disease biomarkers from biomedical images. This approach uses ML and pattern recognition techniques to draw insights from massive volumes of clinical image data to transform the diagnosis, treatment and monitoring of patients. It focuses on enhancing the diagnostic capability of medical imaging for clinical decision-making.

A number of software tools have been developed based on functionalities such as generic, registration, segmentation, visualization, reconstruction, simulation and diffusion to perform medical image analysis in order to dig out the hidden information. For example, Visualization Toolkit is a freely available software which allows powerful processing and analysis of 3D images from medical tests, while SPM can process and analyze 5 different types of brain images (e.g. MRI, fMRI, PET, CT-Scan and EEG). Other software like GIMIAS, Elastix, and MITK support all types of images. Various other widely used tools and their features in this domain are listed in Table 1. Such bioinformatics-based big data analysis may extract greater insights and value from imaging data to boost and support precision medicine projects, clinical decision support tools, and other modes of healthcare. For example, we can also use it to monitor new targeted-treatments for cancer.

Table 1 Bioinformatics tools for medical image processing and analysis

Tools/softwares VTK ITK DTI-TK ITK-Snap FSL SPM NiftyReg NiftySeg NifttSim NiftRec ANTS GIMIAS elastix MIA MITK Camino OsiriX MRIcron IMOD
Input image support
 MRI   x x   x x x x x x x x x     x x x x
 Ultrasound     x       x x x x   x x            
 X-ray     x       x x x     x x            
 fMRI         x x         x x x     x   x  
 PET     x     x           x x            
 CT-Scan           x       x   x x            
 EEG     x     x           x x            
 Mammogram                       x x     x      
 Graphical user interface x x x x x x x x x x x x x x x x x x x
Functions
 Generic x x     x x   x     x x x     x   x  
 Registration   x x     x x x     x x x   x x x   x
 Segmentation   x       x   x x x   x x x x x x    
 Visualization x     x     x   x x   x x   x x x x x
 Reconstruction x   x x x x     x x x x x x         x
 Simulation x   x x x       x x   x x x          
 Diffusion x   x     x       x   x x   x x      


Big data from omics

The big data from "omics" studies is a new kind of challenge for the bioinformaticians. Robust algorithms are required to analyze such complex data from biological systems. The ultimate goal is to convert this huge data into an informative knowledge base. The application of bioinformatics approaches to transform the biomedical and genomics data into predictive and preventive health is known as translational bioinformatics. It is at the forefront of data-driven healthcare. Various kinds of quantitative data in healthcare, for example from laboratory measurements, medication data and genomic profiles, can be combined and used to identify new meta-data that can help precision therapies. This is why emerging new technologies are required to help in analyzing this digital wealth. In fact, highly ambitious multimillion-dollar projects like "Big Data Research and Development Initiative" have been launched that aim to enhance the quality of big data tools and techniques for a better organization, efficient access and smart analysis of big data. There are many advantages anticipated from the processing of 'omics' data from large-scale Human Genome Project and other population sequencing projects. In the population sequencing projects like 1000 genomes, the researchers will have access to a marvelous amount of raw data. Similarly, Human Genome Project based Encyclopedia of DNA Elements (ENCODE) project aimed to determine all functional elements in the human genome using bioinformatics approaches. Here, we list some of the widely used bioinformatics-based tools for big data analytics on omics data.

  1. SparkSeq is an efficient and cloud-ready platform based on Apache Spark framework and Hadoop library that is used for analyses of genomic data for interactive genomic data analysis with nucleotide precision

  2. SAMQA identifies errors and ensures the quality of large-scale genomic data. This tool was originally built for the National Institutes of Health Cancer Genome Atlas project to identify and report errors including sequence alignment/map [SAM] format error and empty reads.

  3. ART can simulate profiles of read errors and read lengths for data obtained using high throughput sequencing platforms including SOLiD and Illumina platforms.

  4. DistMap is another toolkit used for distributed short-read mapping based on Hadoop cluster that aims to cover a wider range of sequencing applications. For instance, one of its applications namely the BWA mapper can perform 500 million read pairs in about 6 h, approximately 13 times faster than a conventional single-node mapper.

  5. SeqWare is a query engine based on Apache HBase database system that enables access for large-scale whole-genome datasets by integrating genome browsers and tools.

  6. CloudBurst is a parallel computing model utilized in genome mapping experiments to improve the scalability of reading large sequencing data.

  7. Hydra uses the Hadoop-distributed computing framework for processing large peptide and spectra databases for proteomics datasets. This specific tool is capable of performing 27 billion peptide scorings in less than 60 min on a Hadoop cluster.

  8. BlueSNP is an R package based on Hadoop platform used for genome-wide association studies (GWAS) analysis, primarily aiming on the statistical readouts to obtain significant associations between genotype–phenotype datasets. The efficiency of this tool is estimated to analyze 1000 phenotypes on 106 SNPs in 104 individuals in a duration of half-an-hour.

  9. Myrna the cloud-based pipeline, provides information on the expression level differences of genes, including read alignments, data normalization, and statistical modeling.

The past few years have witnessed a tremendous increase in disease specific datasets from omics platforms. For example, the ArrayExpress Archive of Functional Genomics data repository contains information from approximately 30,000 experiments and more than one million functional assays. The growing amount of data demands for better and efficient bioinformatics driven packages to analyze and interpret the information obtained. This has also led to the birth of specific tools to analyze such massive amounts of data. Below, we mention some of the most popular commercial platforms for big data analytics.