Advertisement
Original software publication| Volume 16, 100493, May 2023

Hygieia: AI/ML pipeline integrating healthcare and genomics data to investigate genes associated with targeted disorders and predict disease

  • Author Footnotes
    1 Equally contributing first authors.
    William DeGroat
    Footnotes
    1 Equally contributing first authors.
    Affiliations
    Institute for Health, Health Care Policy and Aging Research, Rutgers University, 112 Paterson Street, New Brunswick, NJ, USA
    Search for articles by this author
  • Author Footnotes
    1 Equally contributing first authors.
    Vignesh Venkat
    Footnotes
    1 Equally contributing first authors.
    Affiliations
    Institute for Health, Health Care Policy and Aging Research, Rutgers University, 112 Paterson Street, New Brunswick, NJ, USA
    Search for articles by this author
  • Widnie Pierre-Louis
    Affiliations
    Institute for Health, Health Care Policy and Aging Research, Rutgers University, 112 Paterson Street, New Brunswick, NJ, USA
    Search for articles by this author
  • Habiba Abdelhalim
    Affiliations
    Institute for Health, Health Care Policy and Aging Research, Rutgers University, 112 Paterson Street, New Brunswick, NJ, USA
    Search for articles by this author
  • Zeeshan Ahmed
    Correspondence
    Correspondence to: Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, 112 Paterson Street, New Brunswick, 08901, NJ, USA.
    Affiliations
    Institute for Health, Health Care Policy and Aging Research, Rutgers University, 112 Paterson Street, New Brunswick, NJ, USA

    Department of Medicine, Robert Wood Johnson Medical School, Rutgers Biomedical and Health Sciences, 125 Paterson Street, New Brunswick, NJ, USA
    Search for articles by this author
  • Author Footnotes
    1 Equally contributing first authors.
Open AccessPublished:March 14, 2023DOI:https://doi.org/10.1016/j.simpa.2023.100493

      Highlights

      • We present, Hygieia, an AI/ML pipeline to predict disease using genomics and healthcare data.
      • We have tested Hygieia and evaluated its performance using variable experimental datasets.
      • Portable AI/ML pipelines like Hygieia are heavily needed today to support clinical diagnostics and decision-making processes

      Abstract

      Due to the advancements in sequencing technologies, genomics data is developing at an unmatched pace and levels to foster translational research. Over ten million genomics datasets have been produced and publicly shared in the year 2022. Genome-wide association studies (GWAS) have remarkably assisted in understanding the genetic basis of human disease by uncovering millions of loci associated with various complex phenotypes. However, GWAS are unable to predict disease and detect all the heritability explained by single nucleotide polymorphisms (SNPs) and can only target specific variants. The rightful use of the artificial intelligence (AI) and machine learning (ML) techniques can accelerate our ability to leverage and extend the information contained within the original data, and model patient-specific genomics data against publicly available annotation repositories for understanding how coding and non-coding genomic variations are connected to disease mechanisms. The grand challenge here is assimilation of genetics into precision medicine that translates across different ancestries, diverse diseases, and other distinct populations with the implementation of effective AI/ML methods. We present first AI/ML ready pipeline i.e., Hygieia., integrating genomics and clinical data to investigate genes associated with the targeted disorders and predict disease with high accuracy. Hygieia can utilize broad dataset sizes with heterogeneous levels of granularity and offer a supervised approach to analyze integrated gene expression and multivariate clinical data. It includes the Random Forest based model for regression analysis and predict without hyperparameter tuning. We trained and tested our model across variable disorders and using diverse datasets. Hygieia is an open-source and simple to use pipeline, which does not strong require computational background to execute.

      Keywords

      Code metadata
      Tabled 1
      Current Code VersionHygieia v1.0.2
      Permanent Link to Repositoryhttps://github.com/SoftwareImpacts/SIMPAC-2023-36
      Reproducible Capsulehttps://codeocean.com/capsule/7964745/tree/v1
      Legal LicenseGNU General Public License (GPL)
      Code Versioning SystemGit
      Software Code LanguagesPython 3.10.9
      Compilation Requirements, Dependenciespandas, sci-kit learn, matplotlib, seaborn
      Support email for questions[email protected]

      1. Introduction

      Precision and genomics medicine is driven by the paradigm shift of empowering clinicians to predict the most appropriate course of action for patients with complex diseases and to improve routine medical and public health practice [
      • Ahmed Z.
      Precision medicine with multi-omics strategies, deep phenotyping, and predictive analysis.
      ,
      • Ahmed Z.
      Practicing precision medicine with intelligently integrative clinical and multi-omics data analysis.
      ]. However, the demands of the healthcare field are becoming increasingly complex, as there is a delicate balance between timely medical interventions and concise, yet effective, patient care plans. The available methods go beyond hospitalization, as the scope of preventative health measures includes disease prevention, early detection, and delayed progression [
      • Ahmed Z.
      • Mohamed K.
      • Zeeshan S.
      • Dong X.
      Artificial intelligence with multi-functional machine learning platform development for better healthcare and precision medicine.
      ]. The goal for acute patients is to provide a substantial turnaround in patient conditions.
      The rightful use of the artificial intelligence (AI) and machine learning (ML) can accelerate our ability to leverage and extend the information contained within the original data, and model patient-specific genomics data against publicly available annotation repositories for understanding how coding and non-coding genomic variations are connected to disease mechanisms [
      • Buch V.H.
      • Ahmed I.
      • Maruthappu M.
      Artificial intelligence in medicine: Current trends and future possibilities.
      ,
      • Vadapalli S.
      • Abdelhalim H.
      • Zeeshan S.
      • Ahmed Z.
      Artificial intelligence and machine learning approaches using gene expression and variant data for personalized medicine.
      ]. The exponential growth in the ML and AI fields has made it possible to leverage its innate computational power in clinical and translational data to revolutionize how we understand, diagnose, and treat heritable disorders [
      • Ahmed Z.
      • Mohamed K.
      • Zeeshan S.
      • Dong X.
      Artificial intelligence with multi-functional machine learning platform development for better healthcare and precision medicine.
      ].
      The next-generation sequencing (NGS) has rapidly developed since its inception in the early 2000s and has allowed us to produce and access large amounts of biological data for in depth computational analysis and variant discovery [
      • Ahmed Z.
      Precision medicine with multi-omics strategies, deep phenotyping, and predictive analysis.
      ,
      • Buch V.H.
      • Ahmed I.
      • Maruthappu M.
      Artificial intelligence in medicine: Current trends and future possibilities.
      ]. The convergence of genomics data and staggering developments in AI/ML have the potential to elevate recovery process with diagnostic and predictive analysis to identify major causes of mortality, modifiable risk factors and actionable information that supports early detection and prevention of targeted disorders and sequela [
      • Abdelhalim H.
      • Berber A.
      • Lodi M.
      • Jain R.
      • Nair A.
      • Pappu A.
      • Patel K.
      • Venkat V.
      • Venkatesan C.
      • Wable R.
      • Dinatale M.
      • Fu A.
      • Iyer V.
      • Kalove I.
      • Kleyman M.
      • Koutsoutis J.
      • Menna D.
      • Paliwal M.
      • Patel N.
      • Patel T.
      Artificial intelligence, healthcare, clinical genomics, and pharmacogenomics approaches in precision medicine.
      ].
      Genome-wide association studies (GWAS) have remarkably assisted in understanding the genetic basis of human disease by uncovering millions of loci associated with various complex phenotypes [
      • Visscher P.M.
      • Wray N.R.
      • Zhang Q.
      • Sklar P.
      • McCarthy M.I.
      • Brown M.A.
      • Yang J.
      10 Years of GWAS discovery: Biology, function, and translation.
      ]. However, GWAS are unable to predict disease and detect all the heritability explained by single nucleotide polymorphisms (SNPs) and can only target specific variants [
      • Tam V.
      • Patel N.
      • Turcotte M.
      • Bossé Y.
      • Paré G.
      • Meyre D.
      Benefits and limitations of genome-wide association studies.
      ]. However, AI/ML can solve these problems through various algorithms. In this study, we propose a new pipeline AI/ML i.e., Hygieia, which integrate healthcare and genomics data to investigate genes associated with targeted disorders and predict disease.

      2. Software description

      The proposed methodology in Hygieia is based on an important review study conducted and published by the Ahmed Lab [
      • Vadapalli S.
      • Abdelhalim H.
      • Zeeshan S.
      • Ahmed Z.
      Artificial intelligence and machine learning approaches using gene expression and variant data for personalized medicine.
      ]. Hygieia facilitates feature selection and visualization methods on its user’s imported data, offering results that show the correlation between factors and a diagnosis in an understandable fashion. Hygieia core comprises two functions, each encompassing a distinctive set of methods the user may choose to employ. Feature selection methods depend on sci-kit learn, while visualization relies on matplotlib and seaborn. Pandas facilitate data management for the entire pipeline.
      clinical_feature_selection’, handles patient’s non-genetic characteristics; It offers three routes to explore this information, each of which outputs a separate file:
      • Random Forest Regression averages the outcomes of multiple decision trees to score the feature’s significance. Hygieia uses an ‘80/20’ train/test split by default. It is important to note that predictions are made without hyperparameter tuning [
        • Barba M.
        • Czosnek H.
        • Hadidi A.
        Historical perspective, development and applications of next-generation sequencing in plant virology.
        ].
      • SelectKBest implements Pearson’s chi-squared test; This approach draws its roots from classical statistics and provides a dichotomy to Random Forest Regression.
      • Swarm plots serve as a visualization, allowing users to see where patients with a diagnosis cluster. As many clinical features are non-continuous, acceptable labels within those categories are converted to a numeric form per electronic health record standards.
      genomic_feature_selection’, is responsible for the bulk workload, as it conducts gene expression analysis; It contains four methods for feature selection and visualization, each outputting separate files:
      • Random Forest Regression is paired with Recursive Feature Elimination, Cross-Validated [
        • Rigatti S.J.
        Random forest.
        ]. The initial test/train split remains standard, at ‘80/20’, but is transformed to better fit the data through the algorithm’s runtime. This serves to limit overfitting.
      • SelectKBest implements Pearson’s chi-squared test. This approach is not modified from its previous implementation.
      • Swarm plots serve as a visualization, allowing users to see where patients with a diagnosis cluster. Genomic features are all continuous.
      • Heatmaps serve as a visualization, allowing users to gain insight into the way genomic factors are correlated with each other. This visualization is generated from a correlation matrix which inputted data is used to create.
      Fig. 1 demonstrates data flow and modalities. Fig. 1A describes the types of input data, which includes features (age, race, gender, and diagnosis) extracted from Electronic Health Records (EHR), and gene expression data computed from RNA-seq driven transcriptomic data using GVViZ [
      • Ahmed Z.
      • Renart E.G.
      • Zeeshan S.
      • Dong X.
      Advancing clinical genomics and precision medicine with GVViZ: FAIR bioinformatics platform for variable gene-disease annotation, visualization, and expression analysis.
      ]. Fig. 1B depicts the two functions and their respective feature selections. While, Fig. 1C lists the possible outputs, which include visualizations that could be implemented, and exported text results.
      Figure thumbnail gr1
      Fig. 1Hygieia’s input, methodology, workflow, and outcomes.
      Output visualizations are provided in Fig. 2 that demonstrate various graphs Hygieia’s can yield, such as bar graphs (Fig. 2A), swarmplot (Fig. 2B), and heatmaps (Fig. 2C). The supplementary material attached to this publication offers an in-depth walkthrough of Hygieia’s installation, utilization, and efficiency, including an illustrative example that can be performed using files included within Hygieia’s GitHub.

      3. Software impacts

      Genomics is leading toward audacious future and has been changing our views about conducting strong bioinformatics research to study complex diseases and understanding diversity across the human species for over the last few decades. However, it still requires the development of intelligent application that systematically incorporate multi-omics/genomics data into clinical care to deliver personalized treatment for patients at risk. Despite current progress, there are no bioinformatics tools available, which can help in understanding the relationships between a genotypes and phenotypes.
      AI/ML approaches have excelled in various scientific and clinical fields, including biomedical image processing and pattern recognition on scales beyond human capabilities [
      • Pietka E.
      • Gertych A.
      Advances in biomedical image processing.
      ]. However, well integrated, high volume, and heterogeneous genomic and clinical data analysis using AI/ML approaches have been a challenge and not accomplished with high accuracy [
      • Ahmed Z.
      • Mohamed K.
      • Zeeshan S.
      • Dong X.
      Artificial intelligence with multi-functional machine learning platform development for better healthcare and precision medicine.
      ]. Addressing such challenges and contributing to the fields of bioinformatics and biomedical information, we have presented Hygieia in this study. Hygieia is an open-source, simple to configure, and easy to use AI/ML ready pipeline. We have designed Hygieia in a way that both users (e.g., clinicians, geneticist, biologist, and scientists) with and without computational background can learn and practice it. However, due to open-source, bioinformatics users can customize according to their analysis and improve it further with the inclusion of AI/ML techniques. We have tested Hygieia and evaluated its performance using variable experimental datasets. However, we have validated its methodology in a real time study, where we have successfully investigated genes associated with cardiovascular diseases (CVD) and predicted disease with high accuracy [
      • Venkat V.
      • Abdelhalim H.
      • DeGroat W.
      • Zeeshan S.
      • Ahmed Z.
      Investigating genes associated with heart failure, atrial fibrillation, and other cardiovascular diseases, and predicting disease using machine learning techniques for translational research and precision medicine.
      ].
      Portable applications and genomics pipelines like Hygieia are heavily needed today, as these have the potential to support clinical diagnostics and decision-making processes by efficiently analyzing integrated multi-omics and clinical data for supporting translational research and precision medicine. It can accelerate diagnostic and preventive care delivery strategies beyond traditional symptom-driven, and disease-causal medical practices. Using AI/ML approaches, the human genome, transcriptome, epigenome, proteome, and metabolome can be analyzed to provide personalized treatments to patients suffering with complex and chronic disorders. It will leverage with the automated data analytics approach using AI/ML algorithms and training data models to predict probabilities of variables disorders (e.g., cancer, stroke, diabetes mellitus, kidney, and Alzheimer’s disease) with the high accuracy. The potential implications of Hygieia and similar other approaches will accelerate our ability to use AI/ML techniques for discoveries and important breakthroughs in medical and life sciences with broad impact.

      4. Limitations & future improvements

      In terms of implementation and application, there is no technical limitation in Hygieia. However, In the current version of Hygieia, the input dataset is based on the features extracted from the healthcare and gene expression data. In future, we are looking forward to integrating gene variant data, extend analytic capabilities, test, and validate its performance against larger genomics datasets of complex and variable disorders.

      CRediT authorship contribution statement

      William DeGroat: Programmed, Writing – review & editing. Vignesh Venkat: Designed the software, Hygieia, Writing – review & editing. Widnie Pierre-Louis: Supported the pre- and post-computational analysis, Evaluation of results and preparation of the manuscript and supplementary material, Writing – review & editing. Habiba Abdelhalim: Supported the pre- and post-computational analysis, Evaluation of results and preparation of the manuscript and supplementary material, Writing – review & editing. Zeeshan Ahmed: Proposed and led this study, Drafted and revised the manuscript, Writing – review & editing.

      Declaration of Competing Interest

      The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

      Acknowledgments

      We appreciate great support by the Rutgers Institute for Health, Health Care Policy, and Aging Research (IFH); Department of Medicine, Rutgers Robert Wood Johnson Medical School (RWJMS), United States; and Rutgers Biomedical and Health Sciences (RBHS), United States, at the Rutgers, The State University of New Jersey, United States . We thank members and collaborators of Ahmed Lab at the Rutgers (IFH, RWJMS, RBHS) for their support, participation, and contribution to this study. All authors approved the version of the manuscript to be published.
      We are grateful to the Office of Advanced Research Computing (OARC) at Rutgers for supporting with the Amarel cluster and associated research computing resources that have contributed to the results reported here.

      Funding

      This study was supported by the Institute for Health, Health Care Policy and Aging Research (IFH), and Rutgers Robert Wood Johnson Medical School (RWJMS), Rutgers Biomedical and Health Sciences (RBHS) at the Rutgers, The State University of New Jersey.

      Appendix A. Supplementary data

      The following is the Supplementary material related to this article.

      References

        • Ahmed Z.
        Precision medicine with multi-omics strategies, deep phenotyping, and predictive analysis.
        Progress Mol. Biol. Transl. Sci.: Precision Med. 2022; 190: 10
        • Ahmed Z.
        Practicing precision medicine with intelligently integrative clinical and multi-omics data analysis.
        Hum. Genom. 2020; 14: 35
        • Ahmed Z.
        • Mohamed K.
        • Zeeshan S.
        • Dong X.
        Artificial intelligence with multi-functional machine learning platform development for better healthcare and precision medicine.
        Database. 2020; 2020
        • Buch V.H.
        • Ahmed I.
        • Maruthappu M.
        Artificial intelligence in medicine: Current trends and future possibilities.
        Br. J. General Pract.: J. R. College General Pract. 2018; 68: 143-144
        • Vadapalli S.
        • Abdelhalim H.
        • Zeeshan S.
        • Ahmed Z.
        Artificial intelligence and machine learning approaches using gene expression and variant data for personalized medicine.
        Brief. Bioinform. 2022; 23: bbac191
        • Abdelhalim H.
        • Berber A.
        • Lodi M.
        • Jain R.
        • Nair A.
        • Pappu A.
        • Patel K.
        • Venkat V.
        • Venkatesan C.
        • Wable R.
        • Dinatale M.
        • Fu A.
        • Iyer V.
        • Kalove I.
        • Kleyman M.
        • Koutsoutis J.
        • Menna D.
        • Paliwal M.
        • Patel N.
        • Patel T.
        Artificial intelligence, healthcare, clinical genomics, and pharmacogenomics approaches in precision medicine.
        Front. Genet. 2022; 13
        • Visscher P.M.
        • Wray N.R.
        • Zhang Q.
        • Sklar P.
        • McCarthy M.I.
        • Brown M.A.
        • Yang J.
        10 Years of GWAS discovery: Biology, function, and translation.
        Am. J. Hum. Genet. 2017; 101: 5-22
        • Tam V.
        • Patel N.
        • Turcotte M.
        • Bossé Y.
        • Paré G.
        • Meyre D.
        Benefits and limitations of genome-wide association studies.
        Nat. Rev. Genet. 2019; 20: 467-484
        • Barba M.
        • Czosnek H.
        • Hadidi A.
        Historical perspective, development and applications of next-generation sequencing in plant virology.
        Viruses. 2014; 6: 106-136
        • Rigatti S.J.
        Random forest.
        J. Insurance Med. 2017; 47: 31-39
        • Ahmed Z.
        • Renart E.G.
        • Zeeshan S.
        • Dong X.
        Advancing clinical genomics and precision medicine with GVViZ: FAIR bioinformatics platform for variable gene-disease annotation, visualization, and expression analysis.
        Hum. Genomics. 2021; 15: 37
        • Pietka E.
        • Gertych A.
        Advances in biomedical image processing.
        Comput. Med. Imaging Graph.: Official J. Comput. Med. Imaging Soc. 2021; 89101891
        • Venkat V.
        • Abdelhalim H.
        • DeGroat W.
        • Zeeshan S.
        • Ahmed Z.
        Investigating genes associated with heart failure, atrial fibrillation, and other cardiovascular diseases, and predicting disease using machine learning techniques for translational research and precision medicine.
        Genomics. 2023; 115110584