Advertisement
Original software publication| Volume 12, 100294, May 2022

Connecting the dots in clinical document understanding with Relation Extraction at scale

Open AccessPublished:April 30, 2022DOI:https://doi.org/10.1016/j.simpa.2022.100294

      Highlights

      • Easy to use, scalable NLP framework that can leverage Spark.
      • Introduction of BERT based Relation Extraction models.
      • State-of-the-art performance on Named Entity Recognition and Relation Extraction.
      • Reported SOTA performance of multiple public benchmark datasets.
      • Application of these models on real-world use-cases.

      Abstract

      We present a text mining framework based on top of the Spark NLP library — comprising of Named Entity Recognition (NER) and Relation Extraction (RE) models, which expands on previous work in three main ways. First, we release new RE model architectures that obtain state-of-the-art F1 scores on 5 out of 7 benchmark datasets. Second, we introduce a modular approach to train and stack multiple models in a single nlp pipeline in a production grade library with little coding. Third, we apply these models in practical applications including knowledge graph generation, prescription parsing, and robust ontology mapping.

      Keywords

      Code metadata
      Tabled 1
      Current code versionv1.0
      Permanent link to code/repository used for this code versionhttps://github.com/SoftwareImpacts/SIMPAC-2022-44
      Permanent link to Reproducible Capsulehttps://codeocean.com/capsule/1636734/tree/v1
      Legal Code LicenseApache License, 2.0
      Code versioning system usedgit
      Software code languages, tools, and services usedpython, scala, java
      Compilation requirements, operating environments & dependenciesWindows or Linux, JVM, Spark
      If available Link to developer documentation/manualhttps://nlp.johnsnowlabs.com/docs/en/licensed_install
      Support email for questions[email protected]

      1. Introduction

      Biomedical literature has witnessed exponential rise in the past decade. MEDLINE currently holds more than 26 million records from 5639 publications, and has indexed more than 5 million records in the past seven years alone [
      • Yadav S.
      • Ramesh S.
      • Saha S.
      • Ekbal A.
      Relation extraction from biomedical and clinical text: Unified multitask learning framework.
      ]. While publications and literature are growing rapidly, there is deficiency of structured knowledge that can be easily processed by computer programs. Although recent advances in NLP, (e.g. BERT [
      • Devlin J.
      • Chang M.
      • Lee K.
      • Toutanova K.
      BERT: Pre-training of deep bidirectional transformers for language understanding.
      ]), have proven to be more accurate, they require high computational resources and technical knowledge for implementation. This work presents a text mining framework comprising of NER and RE models with focus on scalability, usability, and accuracy. In addition, we put emphasis on the concept of relation extraction and its usage.
      Relation extraction is the concept of linking entity pairs – usually identified by NER models – to each other in a given context. Consequently, NER models form the foundation as they produce entity spans that are fed to the relation extraction model. Due to the dependency of RE models on NER models, there is a recent trend of jointly training large BERT models on NER and RE tasks with shared layers and features [
      • Wang J.
      • Lu W.
      Two are better than one: Joint entity and relation extraction with table-sequence encoders.
      ]. However, even in joint learning, the RE classification is still contingent upon entity spans identified by the NER model.

      2. Approach

      We treat Relation Extraction as a classification problem where each example is a pair of biomedical entities appearing in a given context — the entities being NER chunks, and context being the sentence/entire document. Our solution is a stand-alone model based on the BioBERT [
      • Lee J.
      • Yoon W.
      • Kim S.
      • Kim D.
      • Kim S.
      • So C.
      • Kang J.
      BioBERT: A pre-trained biomedical language representation model for biomedical text mining.
      ] architecture with a sequence length of 128 tokens which achieves state-of-the-art performance on multiple benchmark datasets [
      • Haq H.
      • Kocaman V.
      • Talby D.
      Deeper clinical document understanding using relation extraction.
      ].
      In Spark NLP library, we have adopted a modular approach in which NLP components are placed sequentially to form pipelines. This modularity has four main benefits. First, it provides access to results of each component in the pipeline, allowing visibility for evaluating and tuning each component individually. Second, it allows training of each component/model independently — even if one stage is a prerequisite of another stage. Third, the concept of pipeline components reduces coupling, allowing addition/removal of components without major code and architecture changes. Fourth, common components (like tokenization, sentence segmentation, embedding generation etc.) can be shared to achieve a higher degree of memory and computational efficiency. The RE model can be placed in a single pipeline after the NER model, and are fed the entity spans, the context, embeddings, and dependency tree for feature generation. We also use dependency parsing for regularizing candidate entity pairs by eliminating pairs having a larger syntactic distance before feeding to the RE model, hence improving speed. Fig. 1 explains the pipeline components and data flow between them.
      Figure thumbnail gr1
      Fig. 1Illustration of a Spark NLP pipeline containing multiple NER and RE models.
      Figure thumbnail gr2
      Fig. 2Using RE to link drugs with their dosage and frequency. Drug is considered the parent entity having dosage and frequency as attributes.
      Figure thumbnail gr3
      Fig. 3Output of the ADE RE model on sample data. Arrows with 0 represent the two entities are not related, while 1 represents that the reaction is caused by the drug.

      3. Applications

      To process unstructured text, correct combination of NER and RE models is paramount, for which, Spark NLP [
      • Kocaman V.
      • Talby D.
      Spark NLP: Natural language understanding at scale.
      ] provides a suite of pretrained NER and RE models [
      • Kocaman V.
      • Talby D.
      Improving clinical document understanding on COVID-19 research with spark NLP.
      ]. Selection of these models primarily depends on the use-case (what type of entities to extract) and content of the document. As explained earlier in Fig. 1, multiple models can be selected and merged in a single pipeline to process different types of records.

      3.1 Prescription parsing

      We applied the Posology NER and RE models (capable of extracting and linking drugs and their administration instructions) to parse unstructured prescription notes to generate a structured output as explained in Fig. 2.

      3.2 Adverse drug event detection

      Training NER and RE models on adverse drug event dataset, we built an extensive pipeline for identifying drugs and their reported reactions [
      • Haq H.
      • Kocaman V.
      • Talby D.
      Mining adverse drug reactions from unstructured mediums at scale.
      ]. This pipeline is capable of processing data from publications and social media to identify which drugs were responsible for which reactions. As we can see in Fig. 3, drugs can have different types of reactions, and in complex sentence structures with multiple mentions of drugs and reaction, it becomes pertinent to identify reaction and its causation.

      3.3 Putting facts on a timeline

      We applied date, symptom, test, and test result extraction models to generate timeline of patients. Fig. 4 shows the generated timeline of a patient over multiple CT scans in a single admission. This information helps understand the trend of vitals and any progression/improvement in disease.

      3.4 Generating knowledge graph

      However, the most notable benefit of the RE models is the ability to generate knowledge graphs from unstructured text. For this application, we used multiple NER and RE models — for instance, we relate procedures with dates and findings to recognize dates of a procedure and its findings along with any existing condition. We use the relations between body parts and procedures to get more specific details of the location of the procedure. Similarly, relating body parts with findings like test results and measurements can add more details to the final output in specific use-cases. More granularity can be achieved by having further subdivisions of body parts. For instance, in our experiment, we divide the body part in three parts; the primary body part (e.g, lung), a sub-part (e.g, lobe), and direction/laterality (e.g, left) of the body part. In practice, these specific entities trickle from the NER model down to the RE models. A graph generated from a sample report can be seen in Fig. 5.
      Figure thumbnail gr4
      Fig. 4A sample timeline of a patient showing calcium score trend, and evolution of cyst over multiple scans in a month.

      3.5 Improved entity mapping

      This concept of extracting granular entities (in favor of larger, more generalized chunks entities) and then linking them semantically enriches primary entities. For example, in Fig. 5 symptoms and procedures are the primary entities which can be mapped to ICD and CPT codes. In such cases, we merge non-primary entities (like body part, procedure technique) into primary entities to enrich them with precise information. These enriched chunks produce more accurate results while doing entity standardization (mapping extracted concepts/chunk to medical coding ontologies like SNOMED, ICD, CPT etc.) as explained in Fig. 6.
      Figure thumbnail gr5
      Fig. 5A graphical representation (with CPT, ICD & SNOMED codes) of the structured data extracted from a sample text.
      Figure thumbnail gr6
      Fig. 6General chunks like “CT Scan” get mapped to general codes, but more specific entities like “CT Scan of chest” get mapped to a more precise code.

      4. Impact overview

      Spark NLP is an evolving library that has widely been used in the academia and industry due to scalability, and production-ready code base. For instance,  [

      K. Nugroho, A. Sukmadewa, N. Yudistira, Large-scale news classification using bert language model: Spark nlp approach, in: 6th International Conference on Sustainable Information Engineering and Technology 2021, 2021, pp. 240–246.

      ] trained BERT models to perform large scale news classification. In addition, Spark NLP’s healthcare specific capabilities (e.g. de-identification, relation extraction) make it distinctly applicable in pharmaceutical and healthcare industries. [
      • Kocaman V.
      • Talby D.
      Improving clinical document understanding on COVID-19 research with spark NLP.
      ,
      • Varol A.
      • Kocaman V.
      • Haq H.
      • Talby D.
      Understanding COVID-19 news coverage using medical NLP.
      ] leveraged the bespoke healthcare models to analyze covid-19 literature, while [
      • Agarwal K.
      • Choudhury S.
      • Tipirneni S.
      • Mukherjee P.
      • Ham C.
      • Tamang S.
      • Baker M.
      • Tang S.
      • Kocaman V.
      • Gevaert O.
      • Others
      Preparing for the next pandemic: Transfer learning from existing diseases via hierarchical multi-modal BERT models to predict COVID-19 outcomes.
      ] used document section segregation and symptom detection to generate training data for transfer learning on covid-19 data. [
      • Ioanovici A.
      • Măruşteri Ş.
      • Feier A.
      • Trambitas-Miron A.
      Spark NLP: A versatile solution for structuring data from endoscopy reports.
      ] used clinical NER and RE models to process endoscopy reports to get a structured output of findings. [
      • Dekermanjian J.
      • Labeikovsky W.
      • Ghosh D.
      • Kechris K.
      MSCAT: A machine learning assisted catalog of metabolomics software tools.
      ] used clinical language models to generate a database of metabolomics software. [
      • Lee J.
      • Dang H.
      • Uzuner O.
      • Henry S.
      MNLP at MEDIQA 2021: Fine-tuning PEGASUS for consumer health question summarization.
      ] used the de-identification of the library to mask protected health information (PHI) from documents. As research expands to document understanding, one avenue to explore is to link entities that are syntactically apart from each other using wider attention spans while being computationally efficient.

      Declaration of Competing Interest

      The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

      Acknowledgments

      Authors of this study are contributors of the Spark NLP library, alongside other contributors. The authors have been involved in designing model architectures and experimentation.

      References

        • Yadav S.
        • Ramesh S.
        • Saha S.
        • Ekbal A.
        Relation extraction from biomedical and clinical text: Unified multitask learning framework.
        CoRR. 2020; (abs/2009.09509, https://arxiv.org/abs/2009.09509)
        • Devlin J.
        • Chang M.
        • Lee K.
        • Toutanova K.
        BERT: Pre-training of deep bidirectional transformers for language understanding.
        CoRR. 2018; (abs/1810.04805)
        • Wang J.
        • Lu W.
        Two are better than one: Joint entity and relation extraction with table-sequence encoders.
        CoRR. 2020; (abs/2010.03851, https://arxiv.org/abs/2010.03851)
        • Lee J.
        • Yoon W.
        • Kim S.
        • Kim D.
        • Kim S.
        • So C.
        • Kang J.
        BioBERT: A pre-trained biomedical language representation model for biomedical text mining.
        CoRR. 2019; (abs/1901.08746)
        • Haq H.
        • Kocaman V.
        • Talby D.
        Deeper clinical document understanding using relation extraction.
        CoRR. 2021; (abs/2112.13259)
        • Kocaman V.
        • Talby D.
        Spark NLP: Natural language understanding at scale.
        Softw. Impacts. 2021; 8 (https://www.sciencedirect.com/science/article/pii/S2665963821000063)100058
        • Kocaman V.
        • Talby D.
        Improving clinical document understanding on COVID-19 research with spark NLP.
        CoRR. 2020; (abs/2012.04005, https://arxiv.org/abs/2012.04005)
        • Haq H.
        • Kocaman V.
        • Talby D.
        Mining adverse drug reactions from unstructured mediums at scale.
        CoRR. 2022; (abs/2201.01405, https://arxiv.org/abs/2201.01405)
      1. K. Nugroho, A. Sukmadewa, N. Yudistira, Large-scale news classification using bert language model: Spark nlp approach, in: 6th International Conference on Sustainable Information Engineering and Technology 2021, 2021, pp. 240–246.

        • Varol A.
        • Kocaman V.
        • Haq H.
        • Talby D.
        Understanding COVID-19 news coverage using medical NLP.
        2022 (arXiv, https://arxiv.org/abs/2203.10338)
        • Agarwal K.
        • Choudhury S.
        • Tipirneni S.
        • Mukherjee P.
        • Ham C.
        • Tamang S.
        • Baker M.
        • Tang S.
        • Kocaman V.
        • Gevaert O.
        • Others
        Preparing for the next pandemic: Transfer learning from existing diseases via hierarchical multi-modal BERT models to predict COVID-19 outcomes.
        2021
        • Ioanovici A.
        • Măruşteri Ş.
        • Feier A.
        • Trambitas-Miron A.
        Spark NLP: A versatile solution for structuring data from endoscopy reports.
        Appl. Med. Inform. 2021; 43: 26
        • Dekermanjian J.
        • Labeikovsky W.
        • Ghosh D.
        • Kechris K.
        MSCAT: A machine learning assisted catalog of metabolomics software tools.
        Metabolites. 2021; 11: 678
        • Lee J.
        • Dang H.
        • Uzuner O.
        • Henry S.
        MNLP at MEDIQA 2021: Fine-tuning PEGASUS for consumer health question summarization.
        in: Proceedings of the 20th Workshop on Biomedical Language Processing. 2021: 320-327 (https://aclanthology.org/2021.bionlp-1.37)