Advertisement
Original software publication| Volume 13, 100373, August 2022

Accurate Clinical and Biomedical Named Entity Recognition at Scale

Open AccessPublished:July 19, 2022DOI:https://doi.org/10.1016/j.simpa.2022.100373

      Highlights

      • Named entity recognition (NER) is one of the most important building blocks of NLP tasks in the medical domain by extracting meaningful chunks from clinical notes and reports, which are then fed to downstream tasks like assertion status detection, entity resolution, relation extraction, and de-identification. Due to the growing volume of healthcare data in unstructured format, an increasingly important challenge is providing high accuracy implementations of state-of-the-art deep learning (DL) algorithms at scale.
      • While recent advances in NLP like Transformers and BERT have pushed the boundaries for accuracy, these methods are significantly slow and difficult to scale on millions of records.
      • In this study, we introduce an agile, production-grade clinical and biomedical NER algorithm based on a modified BiLSTM-CNN-Char DL architecture built on top of Apache Spark.
      • Our NER implementation establishes new state-of-the-art accuracy on 7 of 8 well-known biomedical NER benchmarks and 3 clinical concept extraction challenges: 2010 i2b2/VA clinical concept extraction, 2014 n2c2 de-identification, and 2018 n2c2 medication extraction. Moreover, clinical NER models trained using this implementation outperform the accuracy of commercial entity extraction solutions such AWS Medical Comprehend and Google Cloud Healthcare API by a large margin (8.9% and 6.7% respectively), without using memory-intensive language models.
      • The proposed model requires no handcrafted features or task-specific resources, requires minimal hyperparameter tuning for a given dataset from any domain, can be trained with any embeddings including BERT, and can be trained to support more human languages with no code changes. It is available within a production-grade code base as part of the Spark NLP library, the only open-source NLP library that can scale to make use of a Spark cluster for training and inference, has GPU support, and provides libraries for Python, R, Scala and Java.

      Abstract

      We introduce an agile, production-grade clinical and biomedical Named entity recognition (NER) algorithm based on a modified BiLSTM-CNN-Char DL architecture built on top of Apache Spark. Our NER implementation establishes new state-of-the-art accuracy on 7 of 8 well-known biomedical NER benchmarks and 3 clinical concept extraction challenges: 2010 i2b2/VA clinical concept extraction, 2014 n2c2 de-identification, and 2018 n2c2 medication extraction. Moreover, clinical NER models trained using this implementation outperform the accuracy of commercial entity extraction solutions, AWS Medical Comprehend and Google Cloud Healthcare API by a large margin (8.9% and 6.7% respectively), without using memory-intensive language models.

      Keywords

      Code metadata
      Tabled 1
      Current code versionv3.4.4
      Permanent link to code/repository used for this code versionhttps://github.com/SoftwareImpacts/SIMPAC-2022-75
      Permanent link to Reproducible Capsulehttps://codeocean.com/capsule/1573505/tree/v1
      Legal Code LicenseApache-2.0 License
      Code versioning system usedgit, maven
      Software code languages, tools, and services usedscala, python, java, R
      Compilation requirements, operating environments & dependenciesjdk 8, spark
      If available Link to developer documentation/manualhttps://nlp.johnsnowlabs.com/api/
      Support email for questions[email protected]

      1. Background

      Electronic health records (EHRs) are the primary source of information for clinicians tracking the care of their patients. The EHR of a large medical organization can capture the medical transactions of over 10 million patients throughout the course of a decade. A single hospitalization alone typically generates around 150,000 pieces of data. The potential benefits derived from this data are significant. In aggregate, an EHR of this scale represents 200,000 years of doctor wisdom and 100 million years of patient outcome data, covering a plethora of rare conditions and maladies [
      • Esteva A.
      • Robicquet A.
      • Ramsundar B.
      • Kuleshov V.
      • DePristo M.
      • Chou K.
      • Cui C.
      • Corrado G.
      • Thrun S.
      • Dean J.
      A guide to deep learning in healthcare.
      ]. These records include information such as the reason for administering drugs, previous disorders of the patient, and the outcome of past treatments. It is estimated that unstructured data accounts for more than 80% of currently available healthcare data [
      • Juhn Y.
      • Liu H.
      Artificial intelligence approaches using natural language processing to advance EHR-based clinical research.
      ]. Given the statistics that US doctors spend almost 6 h on documentation in the EHR in a typical workday [
      • Esteva A.
      • Robicquet A.
      • Ramsundar B.
      • Kuleshov V.
      • DePristo M.
      • Chou K.
      • Cui C.
      • Corrado G.
      • Thrun S.
      • Dean J.
      A guide to deep learning in healthcare.
      ], EHRs are the largest source of empirical data in biomedical research, and unlocking this information and making it available for the downstream analysis can significantly advance biomedical and clinical research.
      The widespread adoption of EHRs and the growing wealth of digitized information sources about patients are opening new doors to uncover previously unidentified associations and accelerating knowledge discovery via state-of-the-art Machine Learning (ML) algorithms and new statistical methods. Due to innate obstacles in extracting information from unstructured text data and the high level of preciseness dictated in the healthcare domain, manual data abstraction has been prevalent in the industry.
      It is estimated that unstructured data accounts for more than 80% of currently available healthcare data [
      • Juhn Y.
      • Liu H.
      Artificial intelligence approaches using natural language processing to advance EHR-based clinical research.
      ]. However, automatically extracting common trends or other insights from EHRs requires time-intensive manual review and the extracted data could be used for clinical research, accurate clinical modeling or other administrative tasks. Information fed into these systems may be found in structured fields for which values are inputted electronically (e.g. laboratory test orders or results) [
      • Liede A.
      • Hernandez R.K.
      • Roth M.
      • Calkins G.
      • Larrabee K.
      • Nicacio L.
      Validation of international classification of diseases coding for bone metastases in electronic health records using technology-enabled abstraction.
      ] but most of the time information in these records is unstructured, making it largely inaccessible for statistical analysis  [
      • Murdoch T.B.
      • Detsky A.S.
      The inevitable application of big data to health care.
      ]. Since manual abstraction is an expensive, time consuming and error prone process, there is a growing need for natural language processing (NLP) applications that automate the clinical abstraction process and make EHR data available through fast, scalable, and secure data pipelines.
      Extracting valuable information from EHRs with intelligent systems starts with Named entity recognition (NER), a key building block of common NLP tasks such as question answering, topic modeling, information retrieval, etc [
      • Yadav V.
      • Bethard S.
      A survey on recent advances in named entity recognition from deep learning models.
      ]. In the medical domain, NER plays the most crucial role by giving out the first meaningful chunks of a clinical note, and then feeding them as an input to the subsequent downstream tasks such as clinical assertion status [
      • Uzuner Ö.
      • South B.R.
      • Shen S.
      • DuVall S.L.
      2010 I2b2/VA challenge on concepts, assertions, and relations in clinical text.
      ], clinical entity resolvers  [
      • Tzitzivacos D.
      International classification of diseases 10th edition (ICD-10):: main article.
      ] and de-identification of the sensitive data [
      • Uzuner Ö.
      • Luo Y.
      • Szolovits P.
      Evaluating the state-of-the-art in automatic de-identification.
      ]. However, segmentation of clinical and drug entities is considered to be a difficult task in biomedical NER systems because of complex orthographic structures of named entities [
      • Liu S.
      • Tang B.
      • Chen Q.
      • Wang X.
      Effects of semantic features on machine learning-based drug name recognition systems: word embeddings vs. manually constructed dictionaries.
      ].
      ML methods formulate the clinical NER task as a sequence labeling problem that aims to find the best label sequence for a given input sequence (individual words from clinical text) [
      • Wu Y.
      • Jiang M.
      • Xu J.
      • Zhi D.
      • Xu H.
      Clinical named entity recognition using deep learning models.
      ]. Many top-ranked NER systems applied the Conditional Random Fields (CRFs) model [
      • Lafferty J.
      • McCallum A.
      • Pereira F.C.
      Conditional random fields: Probabilistic models for segmenting and labeling sequence data.
      ], which is one of the most widely used solution among conventional ML algorithms. A typical state-of-the-art clinical NER system usually utilizes features from different linguistic levels, including orthographic information (e.g., capitalization of letters, prefix and suffix), syntactic information (e.g., part of speech (POS) tags), word n-grams, word embeddings, and semantic information (e.g., the UMLS concept unique identifier) [
      • Wu Y.
      • Jiang M.
      • Xu J.
      • Zhi D.
      • Xu H.
      Clinical named entity recognition using deep learning models.
      ]. These features are usually utilized in LSTM [
      • Hochreiter S.
      • Schmidhuber J.
      Long short-term memory.
      ] based neural network frameworks [
      • Huang Z.
      • Xu W.
      • Yu K.
      Bidirectional LSTM-CRF models for sequence tagging.
      ,
      • Chiu J.P.
      • Nichols E.
      Named entity recognition with bidirectional LSTM-CNNs.
      ,
      • Ma X.
      • Hovy E.
      End-to-end sequence labeling via bi-directional lstm-cnns-crf.
      ] and gained popularity among researchers due to their effectiveness of modeling sequential patterns.
      Over the past year, pretraining large neural language models and rich contextual embeddings, such as BERT [
      • Devlin J.
      • Chang M.-W.
      • Lee K.
      • Toutanova K.
      Bert: Pre-training of deep bidirectional transformers for language understanding.
      ] and ELMO [
      • Peters M.E.
      • Neumann M.
      • Iyyer M.
      • Gardner M.
      • Clark C.
      • Lee K.
      • Zettlemoyer L.
      Deep contextualized word representations.
      ], have also led to impressive gains on NER systems and many clinical variants of BERT models such as BioBert [
      • LEE J.
      • YOON W.
      • KIM S.
      • KIM D.
      • KIM S.
      SO CH & KANG J.. Biobert: a pretrained biomedical language representation model for biomedical text mining.
      ], ClinicalBert [
      • Alsentzer E.
      • Murphy J.R.
      • Boag W.
      • Weng W.-H.
      • Jin D.
      • Naumann T.
      • McDermott M.
      Publicly available clinical BERT embeddings.
      ], BlueBert [
      • Peng Y.
      • Yan S.
      • Lu Z.
      Transfer learning in biomedical natural language processing: An evaluation of bert and elmo on ten benchmarking datasets.
      ], SciBert [
      • Beltagy I.
      • Lo K.
      • Cohan A.
      SciBERT: A pretrained language model for scientific text.
      ] and PubmedBert [
      • Gu Y.
      • Tinn R.
      • Cheng H.
      • Lucas M.
      • Usuyama N.
      • Liu X.
      • Naumann T.
      • Gao J.
      • Poon H.
      Domain-specific language model pretraining for biomedical natural language processing.
      ] have been crafted to address biomedical and clinical NER tasks with state-of-the-art results. However, since these methods require significant computational resources during both training and inference, using them in production is often impractical under the restricted computational resources compared to classical pretrained embeddings (e.g. Glove [

      J. Pennington, R. Socher, C.D. Manning, Glove: Global vectors for word representation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP, 2014, pp. 1532–1543.

      ]). A recent study [
      • Arora S.
      • May A.
      • Zhang J.
      • Ré C.
      Contextual embeddings: When are they worth it?.
      ] empirically shows that classical pretrained embeddings can match contextual embeddings on industry-scale data, and often perform within 5 to 10% accuracy (absolute) on benchmark tasks.
      Despite the growing interest and all these ground breaking advances in NER systems, easy to use production ready models and tools are scarce. It is a major obstacles for clinical NLP practitioners to implement the latest algorithms into their data pipelines and apply them quickly. On the other hand, NLP toolkits specialized for processing biomedical and clinical text, such as MetaMap [
      • Aronson A.R.
      • Lang F.-M.
      An overview of MetaMap: historical perspective and recent advances.
      ] and cTAKES [
      • Savova G.K.
      • Masanz J.J.
      • Ogren P.V.
      • Zheng J.
      • Sohn S.
      • Kipper-Schuler K.C.
      • Chute C.G.
      Mayo clinical text analysis and knowledge extraction system (cTAKES): architecture, component evaluation and applications.
      ] typically do not make use of new research innovations such as word representations or neural networks, hence producing less accurate results [
      • Zhang Y.
      • Zhang Y.
      • Qi P.
      • Manning C.D.
      • Langlotz C.P.
      Biomedical and clinical english model packages in the stanza python NLP library.
      ,
      • Neumann M.
      • King D.
      • Beltagy I.
      • Ammar W.
      Scispacy: Fast and robust models for biomedical natural language processing.
      ]. This year, two new libraries, Stanza [
      • Zhang Y.
      • Zhang Y.
      • Qi P.
      • Manning C.D.
      • Langlotz C.P.
      Biomedical and clinical english model packages in the stanza python NLP library.
      ] and SciSpacy [
      • Neumann M.
      • King D.
      • Beltagy I.
      • Ammar W.
      Scispacy: Fast and robust models for biomedical natural language processing.
      ] took the stage to find a solution to the issues discussed above and released new Python-based software libraries. Both libraries offer out-of-the-box clinical and biomedical pretrained NER models utilizing the state-of-the-art deep learning frameworks mentioned above. However none of these libraries or tools can scale up to leverage compute clusters without compromising the accuracy, nor support in-memory distributed data processing solutions such as Apache Spark.
      In this study, we show through experiments that the current NER module of the Spark NLP library [
      • Kocaman V.
      • Talby D.
      Spark NLP: natural language understanding at scale.
      ] achieves new state-of-the-art results on popular biomedical benchmark datasets and clinical concept extraction challenges like the 2010 i2b2/VA challenge [
      • Uzuner Ö.
      • South B.R.
      • Shen S.
      • DuVall S.L.
      2010 I2b2/VA challenge on concepts, assertions, and relations in clinical text.
      ], 2014 n2c2 de-identification challenge [
      • Stubbs A.
      • Kotfila C.
      • Uzuner Ö.
      Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task track 1.
      ], and 2018 n2c2 medication extraction challenge [
      • Henry S.
      • Buchan K.
      • Filannino M.
      • Stubbs A.
      • Uzuner O.
      2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records.
      ], and exceeds commercial entity extraction solutions, AWS Medical Comprehend and Google Cloud Healthcare API by a substantial margin (8.9% and 6.7% respectively) — without using memory-intensive contextual embeddings like BERT. Using the modified version of the well known BiLSTM-CNN-Char NER architecture [
      • Chiu J.P.
      • Nichols E.
      Named entity recognition with bidirectional LSTM-CNNs.
      ] into Apache Spark environment, this NER module can already be extended to other spoken languages with zero code changes and can scale up in Spark clusters. The sample code regarding NER models can be found at the official github repository.
      The specific novel contributions of this paper are the following:
      • Delivering the first production-grade, scalable NER implementation, that is easy to train on any dataset without architectural modification.
      • Delivering state-of-the-art NER models that achieve top scores on biomedical and clinical NER benchmark datasets and exceed commercial entity extraction solutions such as AWS Medical Comprehend and Google Cloud Healthcare API.
      • Explaining the NER model implementation in Spark NLP, which is the only NLP library that can scale up in Spark clusters and supports multiple popular programming languages (Python, R, Scala and Java).

      2. Implementation

      The deep neural network architecture for NER model in Spark NLP is BiLSTM-CNN-Char framework, a slightly modified version of the architecture proposed by Chiu et al. [
      • Chiu J.P.
      • Nichols E.
      Named entity recognition with bidirectional LSTM-CNNs.
      ]. It is a neural network architecture that automatically detects word and character-level features using a hybrid bidirectional LSTM and Convolutional Neural Network (CNN) architecture, eliminating the need for most feature engineering steps.
      In the original framework, the CNN extracts a fixed length feature vector from character-level features. For each word, these vectors are concatenated and fed to the BLSTM network and then to the output layers. They employed a stacked bi-directional recurrent neural network with long short-term memory units to transform word features into named entity tag scores. The extracted features of each word are fed into a forward LSTM network and a backward LSTM network. The output of each network at each time step is decoded by a linear layer and a log-softmax layer into log-probabilities for each tag category. These two vectors are then simply added together to produce the final output. In sum, 50-dimensional pretrained word embeddings is used for word features, 25-dimension character embeddings is used for char features, and capitalization features (allCaps, upperInitial, lowercase, mixedCaps, noinfo) are used for case features. They also made use of lexicons as a form of external knowledge as proposed in [

      L. Ratinov, D. Roth, Design challenges and misconceptions in named entity recognition, in: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL-2009, 2009, pp. 147–155.

      ]. We modified this architecture in a number of ways.
      Even though better results were reported by  [
      • Ghaddar A.
      • Langlais P.
      Robust lexical features for improved neural network named-entity recognition.
      ] through robust lexical features, after experimenting with different parameters and components, we decided to remove lexical features in order to reduce the complexity and relied on pretrained biomedical embeddings, casing features and char features through CNN. As sentences are represented through 2 nested sequences (words & chars), a CNN is applied in a way that each character is embedded in a character embedding matrix, of dimension 25. Then, a 1D Convolution layer processes the sequence of embedded char vectors, followed by a MaxPooling operation. This way, each word gets a vector representation. We used 25 filters and kernel size of 3. It is worth to mention that char features are proved to be highly useful in NER models and had provided a level of immunity to typos and spelling errors. Fig. 1 explains the overall architecture of our NER model 1.
      We implement the model in Tensorflow (TF), and used LSTMBlockFusedCell; which is an extremely efficient LSTM implementation based on [
      • Zaremba W.
      • Sutskever I.
      • Vinyals O.
      Recurrent neural network regularization.
      ], and uses a single TF operation for the entire LSTM. These experiments show that it is both faster and more memory-efficient than LSTMBlockCell. Then we implemented this framework in Scala using TensorFlow API. This setup is ported into Spark and lets the driver node run the entire training using all the available cores on the driver node. Spark NLP also provides a CuDA version of each TensorFlow component which utilize GPU acceleration when available.
      While the NER architecture extracts character level features from word tokens, the most useful features come from semantic vectors for each token. Due to proven efficiency of domain specific word embeddings in NER tasks [
      • Habibi M.
      • Weber L.
      • Neves M.
      • Wiegandt D.L.
      • Leser U.
      Deep learning with word embeddings improves biomedical named entity recognition.
      ], we trained custom biomedical word embeddings using a skip-gram model on PubMed abstracts and case studies, as described in [
      • Mikolov T.
      • Chen K.
      • Corrado G.
      • Dean J.
      Efficient estimation of word representations in vector space.
      ], for learning distributed representations of words using contextual information. The trained word embeddings have 200-dimensions and a vocabulary size of 2.2 million. In order to compare the effectiveness of using biomedical data in our trained embeddings, we also used 300-dimension pretrained GloVe embeddings with 6 billion tokens (denoted by GloVe 6B), trained on Wikipedia and Gigaword-5 dataset [

      J. Pennington, R. Socher, C.D. Manning, Glove: Global vectors for word representation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP, 2014, pp. 1532–1543.

      ]. The average word coverage of this implementation of domain specific word embeddings is 99.5% and the coverage of Glove6B embeddings is 96.1% on the biomedical datasets used in this study.

      2.1 Experiments

      In this section, we describe the datasets, evaluation metrics, and provide an overview of experimental setup. We basically set up three different experiments:
      • Performance and timing evaluation on biomedical NER datasets using the same DL architecture in TF (via Keras) and Spark NLP.
      • Performance evaluation on certain clinical datasets studied in i2b2 and n2c2 challenges in the past.
      • Benchmarking on commercial entity extraction solutions, AWS Medical Comprehend and Google Cloud Healthcare API.

      2.1.1 Datasets

      The development of NLP systems is contingent on access to relevant data and EHRs are notoriously difficult to obtain because of privacy reasons. Despite the recent efforts to de-identify (Appendix 7) and release narrative EHRs for research, these datasets are still very rare. As a result, clinical NLP, as a field has lagged behind, and to address this problem, researchers have released various data sources.
      For our first experiment, we trained individual NER models on 8 publicly available biomedical NER datasets provided by [
      • Wang X.
      • Zhang Y.
      • Ren X.
      • Zhang Y.
      • Zitnik M.
      • Shang J.
      • Langlotz C.
      • Han J.
      Cross-type biomedical named entity recognition with deep multi-task learning.
      ]: AnatEM, BC5CDR, BC4CHEMD, BioNLP13CG, JNLPBA, Linnaeus, NCBI-Disease and S800. These models cover a wide variety of entity types in domains ranging from anatomical analysis to genetics and cellular biology. For the sake of brevity, we do not include details about the nature of the datasets and readers can refer to cited papers for more information.
      Figure thumbnail gr1
      Fig. 1Our proposed NER architecture. While the feature vectors for character embeddings and casing features have fixed dimensions, the main context is provided by token embeddings, which can be generated using multiple approaches (e.g., GloVE, BERT), and can have different dimensions.
      Our second experiment is based on 3 different clinical concept extraction challenges. These challenges are organized by researchers at National NLP Clinical Challenges (n2c2) (formerly known as i2b2 — Informatics for Integrating Biology and the Bedside). These challenges have multiple subtasks, but, aligning with the scope of this paper, we put our focus only on the NER subtasks. These challenges are supported with annotated datasets, that we use for our experiments. We used the 2010 i2b2/VA challenge [
      • Uzuner Ö.
      • South B.R.
      • Shen S.
      • DuVall S.L.
      2010 I2b2/VA challenge on concepts, assertions, and relations in clinical text.
      ] to test performance on clinical entities, the 2014 i2b2 de-identification challenge [
      • Stubbs A.
      • Kotfila C.
      • Uzuner Ö.
      Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task track 1.
      ] to test performance on PHI concepts, and 2018 n2c2 medication extraction challenge [
      • Henry S.
      • Buchan K.
      • Filannino M.
      • Stubbs A.
      • Uzuner O.
      2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records.
      ] to test performance on medicine/drug entities. Concise details of these datasets are available in Appendix 7, while extensive details can be found in the cited papers.
      Our third experiment is a detailed comparison of our proposed solution with commercially available NLP solutions. While there are several commercial NLP solutions that make it easy to use machine learning to extract relevant medical information from unstructured text. AWS Medical Comprehend (AMC) [
      • Bhatia P.
      • Celikkaya B.
      • Khalilia M.
      • Senthivel S.
      Comprehend medical: a named entity recognition and relationship extraction web service.
      ] and Google Cloud Platform (GCP) Healthcare NLP are the most widely used and popular services at the moment. Since the data used by both services is confidential, and given the fact that it is highly expensive and time consuming to develop in-house datasets, we can strongly assume that apart from the proprietary in-house datasets, they must have included publicly available datasets as well for training purposes. To test this hypothesis, and for a fair comparison, we sampled one thousand clinical notes as a test set from MIMIC-III dataset [
      • Johnson A.E.
      • Pollard T.J.
      • Shen L.
      • Li-Wei H.L.
      • Feng M.
      • Ghassemi M.
      • Moody B.
      • Szolovits P.
      • Celi L.A.
      • Mark R.G.
      MIMIC-III, a freely accessible critical care database.
      ] and got them annotated by two medical doctors to populate ground truths for 3 types of entities: Problem, Test and Drug. We decided on these entity types due to the fact that AMC and GCP both have these entities in their tool and they are high in number in the source datasets. The details about the entity mappings between these services can be found at Appendix 7. Problem and Test entities originally come from 2010 i2b2/VA and Drug entities come from 2018 n2c2 challenge, but AMC and GCP enriched & fine tuned their models with additional datasets to achieve a better generalization  [
      • Bhatia P.
      • Celikkaya B.
      • Khalilia M.
      Joint entity extraction and assertion detection for clinical text.
      ]. In order to make a fair comparison, we also enriched the datasets by merging similar entities from 2012 i2b2 challenge [
      • Sun W.
      • Rumshisky A.
      • Uzuner O.
      Evaluating temporal relations in clinical text: 2012 i2b2 challenge.
      ] for Problem and Test entities, and OpenFDA [
      • Kass-Hout T.A.
      • Xu Z.
      • Mohebbi M.
      • Nelsen H.
      • Baker A.
      • Levine J.
      • Johanson E.
      • Bright R.A.
      OpenFDA: an innovative platform providing access to a wealth of FDA’s publicly available data.
      ] for Drug entities.
      All the datasets we used in the experiments follow IOB2 (Inside, Outside, Beginning) [
      • Ramshaw L.
      • Marcus M.
      Text chunking using transformation-based learning.
      ] tagging scheme as this is the primary tagging scheme we use for our NER implementation. There are other tagging schemes as well, like BIOES (Begin, Inside, Outside, End, Single), which is used in the original implementation of our NER architecture, reportedly has considerable performance improvements over IOB like converging quickly, and performing better while predicting token boundaries [

      L. Ratinov, D. Roth, Design challenges and misconceptions in named entity recognition, in: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL-2009, 2009, pp. 147–155.

      ] since it has explicit boundary tags. However, we experienced various performance issues while using BIOES scheme like converging very fast in the early epochs but then failing to generalize further and getting stuck at local minima, and decided to use BIO scheme for our experiments. More details of other tagging schemes can found in Appendix 7.

      2.1.2 Experimental settings

      All experiments were run on a Colab server provided by Google (2vCPU @ 2.2 GHz, 13 GB RAM) and used Apache Spark in local mode (no cluster). We use standard train and test datasets for training and evaluating the models, and report both macro and micro-averaged F1 scores (excluding O’s). For comparison against existing benchmarks, papers, micro-averaging is used for fair comparison, as these papers primarily report micro scores, and consequently, we use the same policy for evaluating commercial APIs on clinical datasets. However, for comparisons which do not rely on external data (e.g., comparison between Keras and our implementation), macro-averaging is used as it is impervious to class imbalance problem, and more representative of the model’s ability to deal with a wide range of labels.
      In terms of hyperparameter tuning, we run experiments by tuning the hyperparameters through Random Search to find the best parameter settings that would produce the best results. Final parameter values, along with testing ranges, can be found in Appendix 7.

      3. Results

      NER system is usually a part of an end-to-end NLP pipeline through which the text is fed and then several text preprocessing steps are applied. Since the DL algorithm we implement is a sentence-wise, and the features (embeddings and casing) are token-wise, sentence splitting and tokenization are the most important steps leading to a better accuracy. Using a DL based sentence detector module [
      • Schweter S.
      • Ahmed S.
      Deep-EOS: General-purpose neural networks for sentence boundary detection.
      ] and a highly customizable rule based tokenizer in Spark NLP, we ensured that the generated features are more informative. If a token, labeled as an entity in training set, cannot be isolated from its appended chars and punctuations, it might be treated as an out of vocabulary word while getting the embeddings and this harms the learning.

      3.1 Accuracy on biomedical NER benchmarks

      In our previous study [
      • Kocaman V.
      • Talby D.
      Biomedical named entity recognition at scale.
      ], we showed through extensive experiments that NER module in Spark NLP library exceeds the biomedical NER benchmarks reported by Stanza in 7 out of 8 benchmark datasets and in every dataset reported by SciSpacy without using heavy contextual embeddings like BERT. We present our results in Table 1. In addition to being the only NLP library capable of scaling up for training and inference in any Spark cluster, Spark NLP NER architecture also obtains new state-of-the-art results on seven public biomedical benchmarks without using memory-intensive contextual embeddings like BERT. Also, we report significant improvements in three benchmark datasets compared to Stanza, which include BC4CHEMD: 93.72% (4.1% gain), Species800: 80.91% (4.6% gain), and JNLPBA: 81.29% (5.2% gain).
      The benchmarks show that the Spark NLP NER models with pretrained clinical embeddings produce better results than Stanza in 7 out of 8 biomedical datasets. It is also surprising to see despite using general-purpose embeddings (GloVe 6B), our NER model exceeds Stanza’s (which uses domain specific embeddings, CharLM - character-level language model [

      A. Akbik, D. Blythe, R. Vollgraf, Contextual string embeddings for sequence labeling, in: Proceedings of the 27th International Conference on Computational Linguistics, 2018, pp. 1638–1649.

      ]) benchmarks in half of the benchmarks.
      Table 1NER performance with in-house clinical glove embeddings, open source glove embeddings (6B) and Bert for Token Classification (BFTC) using biobert base cased across different datasets in the biomedical domain. All scores reported are micro-averaged test F1 excluding O’s. Stanza results are from the paper reported in
      • Zhang Y.
      • Zhang Y.
      • Qi P.
      • Manning C.D.
      • Langlotz C.P.
      Biomedical and clinical english model packages in the stanza python NLP library.
      . The official training and validation sets are merged and used for training before testing the trained models on original test sets. For reproducibility purposes, we use the preprocessed versions of these datasets provided by
      • Wang X.
      • Zhang Y.
      • Ren X.
      • Zhang Y.
      • Zitnik M.
      • Shang J.
      • Langlotz C.
      • Han J.
      Cross-type biomedical named entity recognition with deep multi-task learning.
      (also used by Stanza). Bold scores represent the best scores in the respective row (BFTC scores ignored for this analysis). The results can be reproduced with the Colab notebook shared in the official repository: https://github.com/JohnSnowLabs/spark-nlp-workshop.
      DatasetSpark NLP Clinical Emb.Spark NLP Biobert (BFTC)Spark NLP GloVe 6B Emb.Stanza
      NCBI-Disease89.1390.4887.1987.49
      BC5CDR89.7390.8988.3288.08
      BC4CHEMD93.7294.3992.3289.65
      Linnaeus86.2682.2085.5188.27
      Species80080.9182.5979.2276.35
      JNLPBA81.2978.2479.7876.09
      AnatEM89.1391.6587.7488.18
      BioNLP13-CG85.5887.8384.3084.34

      3.2 Accuracy & speed of TensorFlow vs. Spark NLP

      Since the experiment in the previous section involved hyper parameter tuning (batch size and epoch count specifically), we define a new experiment for testing speed and accuracy metrics of our implementation in Spark with the standard Keras framework, while keeping the hyper parameters constant across all datasets. Also, to make the experiments reproducible, we use the same biomedical datasets, and standard Glove 6B embeddings mentioned above. We trained several models to compare the training speed and performance of our NER implementation in Spark with respect to the same DL architecture in TensorFlow using the Keras API.
      Results show that the Spark NLP implementation beats the same architecture 7 out of 8 times in terms of macro F1 score and is faster to train in half of the datasets on a single machine, using all the available cores in both settings, as explained in Table 2. These results indicate that Spark NLP is not just a wrapper on a TensorFlow; The accuracy improvement mainly comes from our modified NER DL architecture as well as the Spark itself, using a state-of-the-art DAG (directed acyclic graph) scheduler, a query optimizer, and a physical execution engine. The code for TensorFlow implementation and biomedical NER datasets are shared as a supplementary material.
      Table 2Performance evaluation on biomedical NER datasets using the same BiLSTM-CNN-Char architecture in TensorFlow and Spark NLP under the same settings for each dataset. The Spark NLP implementation beats the same architecture 7 out of 8 times in terms of macro F1 score and is faster to train in half of the datasets (macro average F1 score, embeddings glove6B_300d, lr 0.001, dropout 0.5, LSTM state size 200, epoch 10, batch size 128, optimizer Adam). Bold letters represent best results.
      Tensorflow 1.15 (Keras)Spark NLP
      DatasetTime (sec)macro-F1Time (sec)macro-F1
      BC5CDR-disease4090.8403360.858
      BC5CDR-chem4380.8483670.894
      BC4CHEMD29540.89027190.936
      NCBI-Disease3120.8822690.883
      JNLPBA4950.7057430.758
      Species8002150.8132320.820
      Linnaeus7090.7877300.759

      3.3 Accuracy on clinical NER benchmarks

      Using the official train sets from the n2c2 challenges explained in the previous section, we trained NER models, and obtained metrics on the official test sets used in the challenges. The results can be seen in Table 3. The results show that the proposed NER model performs better than the best results published so far on these datasets.
      Table 3Performance metrics on 2010 i2b2/VA clinical concept extraction, 2014 i2b2 de-identification challenge and 2018 n2c2 medication extraction challenge. Scores indicate entity-level (span match) micro F1 scores (strict match, excluding O’s) on the official test scores. BERT-based benchmarks are omitted from this study to make a fair comparison between the similar DL architectures.
      Spark NLPLatest SOTA
      2010 i2b2/VA0.8760.862
      • Uzuner Ö.
      • South B.R.
      • Shen S.
      • DuVall S.L.
      2010 I2b2/VA challenge on concepts, assertions, and relations in clinical text.
      2014 n2c20.9610.955
      • Yang X.
      • Lyu T.
      • Li Q.
      • Lee C.-Y.
      • Bian J.
      • Hogan W.R.
      • Wu Y.
      A study of deep learning methods for de-identification of clinical notes in cross-institute settings.
      2018 n2c20.8990.896
      • Henry S.
      • Buchan K.
      • Filannino M.
      • Stubbs A.
      • Uzuner O.
      2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records.

      3.4 Accuracy vs. Commercial NER services

      There are several commercial NLP solutions that make it easy to use machine learning to extract relevant medical information from unstructured text. AWS Medical Comprehend (AMC) [
      • Bhatia P.
      • Celikkaya B.
      • Khalilia M.
      • Senthivel S.
      Comprehend medical: a named entity recognition and relationship extraction web service.
      ] and Google Cloud Platform (GCP) Healthcare NLP are the most widely used and popular services at the moment. Fueled by high quality data that AWS and Google already have, their solutions are expected to be in the highest quality and generalize better on unseen clinical texts. We build custom testing data (explained in the previous section) to compare these commercial solutions with ours.
      Since the sampled test set comes from a different distribution than that of the training set the models are trained on, the results are not as high as the ones on the official test sets as expected. Nevertheless, the clinical NER models trained with the Spark NLP implementation exceeds commercial entity extraction solutions by a large margin (8.9% and 6.7% respectively) without using memory-intensive language models like transformers. The benchmark results can be seen at Table 4 and the sampled annotations can be shared upon request.
      Table 4Comparison of our NER models with AWS Medical Comprehend (AMC) and Google Cloud Platform (GCP) Healthcare API on randomly sampled 1000 clinical notes from MIMIC-III database. Tests run on three major entity classes (Problem, Test, Drug) and Spark NLP clinical NER models are 8.9% and 6.7% better than AMC and GCP respectively in average (macro F1 score).
      Spark NLP Clinical ModelsAWS Medical ComprehendGCP Healthcare API
      EntitySamplePrecisionRecallF1PrecisionRecallF1PrecisionRecallF1
      Problem48910.7260.5850.6480.5390.4780.5070.8500.5160.642
      Test59030.7820.6620.7170.5940.7030.6440.5760.4610.512
      Drug102840.9460.8820.9130.8150.9100.8600.9620.8850.922
      Avg. F10.7590.6700.692

      4. Discussion

      Due to architectural design choices by Tensorflow implementation in JVM at the time of writing this paper, distributing the model training over the worker nodes in the cluster was not viable and effective, and putting the burden of entire training process on the driver node mandated some limitations in terms of computational resources. Nevertheless, being able to get predictions on scale from voluminous data with state-of-the-art accuracy would overwhelm the aforementioned disadvantage. As we explained more in the results section, training with the proposed model is also faster than the base version of the same architecture in plain TF.
      In order to train large models on local machines with less memory, we implemented a dynamic memory optimization feature that can be turned on and off while setting up the training. Through this implementation, we get to decide if we want the features collected and generated at once and then feed into the network batch by batch or collected and generated by batch and then feed into the network in batches. This proved to be highly useful to speed up the training process on large memory machines as well as driver nodes.
      We also implemented a batch prediction in which the Spark executor collects all the rows we have (up to a batch size), merge all the sentences found in this group of rows, feeds them to the network all together. Then the algorithm treats them as batched sentences, gives it back to the planner (a MapPartitions planner) and split them back again into rows. As a result, the features created and collected from the batch and sent to Tensorflow are batched in larger groups, giving us 2.5 to 3 times faster inference in production.
      Distributed processing and cluster computing are mainly useful for processing a large amount of data and using Spark for small data would come with its cost. If we want to illustrate this with an analogy, we can say that Spark is like a locomotive racing a bicycle. The bike will win if the load is light, it is quicker to accelerate and more agile, but with a heavy load the locomotive might take a while to get up to speed, but it is going to be faster in the end. In order to get around the overhead of Spark on small data, we implemented a Lightpipeline concept in Spark NLP, equivalent to Spark ML Pipeline, but meant to deal with smaller amounts of data. Spark NLP LightPipelines are Spark ML pipelines converted into a single machine but the multi-threaded task, becoming more than 10x times faster for smaller amounts of data. To use them, we simply plug in a trained (fitted) pipeline and then annotate a plain text. We do not even need to convert the input text to data frame in order to feed it into a pipeline that is accepting data frame as an input in the first place. This feature is quite useful when it comes to getting a prediction for a few lines of text from a trained DL model before deploying the model on a cluster to process large volume of data.

      5. Conclusion

      Despite the growing interest and ground breaking advances in NLP research and NER systems, easy to use production ready models and tools are scarce in clinical and biomedical domain and it is one of the major obstacles for clinical NLP researchers to implement the latest algorithms into their workflow and start using immediately.
      In this study, we show through experiments on various clinical and biomedical datasets that the NER module of the Spark NLP library requires no handcrafted features or task-specific resources, achieves state-of-the-art scores on popular biomedical datasets and clinical concept extraction challenges (2010 i2b2/VA, 2014 n2c2 de-identification and 2018 n2c2 medication extraction) and exceeds commercial entity extraction solutions, AWS Medical Comprehend and Google Cloud Healthcare API by 8.9% and 6.7% respectively without using heavy language models like transformers. Using the modified version of the well known BiLSTM-CNN-Char NER architecture as well as DL based sentence detector and highly customizable tokenization into Spark environment, even with a general purpose GloVe embeddings and no lexical features, we were able to achieve state-of-the-art results in biomedical domain and produced better results than Stanza in 4 out of 8 benchmark datasets. We also presented that our implementation beats the same architecture in Keras under the same settings 7 out of 8 times on biomedical datasets in terms of macro F1 score and is faster to train in half of the datasets on a single machine.
      Spark NLP’s NER module can also be extended to other spoken languages with zero code changes and can scale up in Spark clusters. In addition, this model is available within a production-grade code base as part of the Spark NLP library; can scale up for training and inference in any Spark cluster; has GPU support and libraries for popular programming languages such as Python, R, Scala and Java; and is already extended to support other human languages with no code changes. Moreover, this model is available within a production-grade code base as part of the open-source Spark NLP library and a new NER model can be trained with a single line of code as presented in Appendix 7. At the time of writing running this study, Spark NLP for Healthcare provides more than 400 clinical entities from more than a hundred pretrained NER models (Fig. 2).
      Figure thumbnail gr2
      Fig. 2Supported NER models and named entities in Spark NLP for Healthcare.

      6. Impact overview

      Downloaded more than 25 million times, Spark NLP comes with more than 5 thousands pretrained models and its licensed extension, Spark NLP for Healthcare comes with more than 600 pretrained clinical models that are all developed and trained with latest state-of-the-art algorithms to solve real world problems in healthcare domain at scale. The NER algorithm investigated thoroughly in this study has widely been used in the academia and industry due to its paramount place within any Spark NLP pipeline. The same NER architecture obtains new state-of-the-art results on seven public biomedical benchmarks  [
      • Kocaman V.
      • Talby D.
      Biomedical named entity recognition at scale.
      ] without using heavy contextual embeddings like BERT; [
      • Kocaman V.
      • Talby D.
      Improving clinical document understanding on COVID-19 research with spark NLP.
      ] leveraged the pretrained healthcare NER models to analyze Covid-19 literature;  [
      • Haq H.U.
      • Kocaman V.
      • Talby D.
      Mining adverse drug reactions from unstructured mediums at scale.
      ] introduced a pretrained NER models that can be used with text classifiers and relation extractions to extract adverse drug reactions from unregulated mediums such as social media; [
      • Agarwal K.
      • Choudhury S.
      • Tipirneni S.
      • Mukherjee P.
      • Ham C.
      • Tamang S.
      • Baker M.
      • Tang S.
      • Kocaman V.
      • Gevaert O.
      • et al.
      Preparing for the next pandemic: Transfer learning from existing diseases via hierarchical multi-modal BERT models to predict COVID-19 outcomes.
      ,

      S. Choudhury, K. Agarwal, C. Ham, P. Mukherjee, S. Tang, S. Tipirneni, C. Reddy, S. Tamang, R. Rallo, V. Kocaman, Tracking the Evolution of COVID-19 via Temporal Comorbidity Analysis from Multi-Modal Data.

      ] used one of the pretrained NER models to extract clinical risk factors from free text notes to create features for developing predictive models from multi-modal electronic healthcare records by leveraging information from more prevalent diseases with shared clinical characteristics; [
      • Dekermanjian J.
      • Labeikovsky W.
      • Ghosh D.
      • Kechris K.
      MSCAT: A machine learning assisted catalog of metabolomics software tools.
      ] used clinical language models to generate a database of metabolomics software; [
      • Varol A.E.
      • Kocaman V.
      • Haq H.U.
      • Talby D.
      Understanding COVID-19 news coverage using medical NLP.
      ] studied the impact of pretrained NER models to analyze Covid-19 related news in two prominent media outlets (CNN and Guardian) and analyzed the key entities, phrases, biases, and how they change over time in news coverage by correlating mined medical symptoms, procedures, drugs, and guidance with commonly mentioned demographic and occupational groups. The same study also analyzed the extraction of Adverse Drug Events about drug and vaccine manufacturers, which when reported by major news outlets has an impact on vaccine hesitancy.

      CRediT authorship contribution statement

      Veysel Kocaman: Conceptualization, Methodology, Writing – original draft, Software, Data curation. David Talby: Software, Investigation, Supervision, Writing – review & editing.

      Declaration of Competing Interest

      The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

      Acknowledgments

      We thank our colleagues and research partners who contributed in the former and current developments of Spark NLP library. We also thank our users and customers who helped us improve the library with their feedbacks and suggestions.

      Appendix A. Supplementary data

      The following is the Supplementary material related to this article.

      References

        • Esteva A.
        • Robicquet A.
        • Ramsundar B.
        • Kuleshov V.
        • DePristo M.
        • Chou K.
        • Cui C.
        • Corrado G.
        • Thrun S.
        • Dean J.
        A guide to deep learning in healthcare.
        Nat. Med. 2019; 25: 24-29
        • Juhn Y.
        • Liu H.
        Artificial intelligence approaches using natural language processing to advance EHR-based clinical research.
        J. Allergy Clin. Immunol. 2020; 145: 463-469
        • Liede A.
        • Hernandez R.K.
        • Roth M.
        • Calkins G.
        • Larrabee K.
        • Nicacio L.
        Validation of international classification of diseases coding for bone metastases in electronic health records using technology-enabled abstraction.
        Clin. Epidemiol. 2015; 7: 441
        • Murdoch T.B.
        • Detsky A.S.
        The inevitable application of big data to health care.
        JAMA. 2013; 309: 1351-1352
        • Yadav V.
        • Bethard S.
        A survey on recent advances in named entity recognition from deep learning models.
        2019 (arXiv preprint arXiv:1910.11470)
        • Uzuner Ö.
        • South B.R.
        • Shen S.
        • DuVall S.L.
        2010 I2b2/VA challenge on concepts, assertions, and relations in clinical text.
        J. Am. Med. Inf. Assoc. 2011; 18: 552-556
        • Tzitzivacos D.
        International classification of diseases 10th edition (ICD-10):: main article.
        CME: Your SA J. CPD. 2007; 25: 8-10
        • Uzuner Ö.
        • Luo Y.
        • Szolovits P.
        Evaluating the state-of-the-art in automatic de-identification.
        J. Am. Med. Inf. Assoc. 2007; 14: 550-563
        • Liu S.
        • Tang B.
        • Chen Q.
        • Wang X.
        Effects of semantic features on machine learning-based drug name recognition systems: word embeddings vs. manually constructed dictionaries.
        Information. 2015; 6: 848-865
        • Wu Y.
        • Jiang M.
        • Xu J.
        • Zhi D.
        • Xu H.
        Clinical named entity recognition using deep learning models.
        in: AMIA Annual Symposium Proceedings, Vol. 2017. American Medical Informatics Association, 2017: 1812
        • Lafferty J.
        • McCallum A.
        • Pereira F.C.
        Conditional random fields: Probabilistic models for segmenting and labeling sequence data.
        2001
        • Hochreiter S.
        • Schmidhuber J.
        Long short-term memory.
        Neural Comput. 1997; 9: 1735-1780
        • Huang Z.
        • Xu W.
        • Yu K.
        Bidirectional LSTM-CRF models for sequence tagging.
        2015 (arXiv preprint arXiv:1508.01991)
        • Chiu J.P.
        • Nichols E.
        Named entity recognition with bidirectional LSTM-CNNs.
        Trans. Assoc. Comput. Linguist. 2016; 4: 357-370
        • Ma X.
        • Hovy E.
        End-to-end sequence labeling via bi-directional lstm-cnns-crf.
        2016 (arXiv preprint arXiv:1603.01354)
        • Devlin J.
        • Chang M.-W.
        • Lee K.
        • Toutanova K.
        Bert: Pre-training of deep bidirectional transformers for language understanding.
        2018 (arXiv preprint arXiv:1810.04805)
        • Peters M.E.
        • Neumann M.
        • Iyyer M.
        • Gardner M.
        • Clark C.
        • Lee K.
        • Zettlemoyer L.
        Deep contextualized word representations.
        2018 (arXiv preprint arXiv:1802.05365)
        • LEE J.
        • YOON W.
        • KIM S.
        • KIM D.
        • KIM S.
        SO CH & KANG J.. Biobert: a pretrained biomedical language representation model for biomedical text mining.
        2019 (arXiv preprint arXiv:1901.08746)
        • Alsentzer E.
        • Murphy J.R.
        • Boag W.
        • Weng W.-H.
        • Jin D.
        • Naumann T.
        • McDermott M.
        Publicly available clinical BERT embeddings.
        2019 (arXiv preprint arXiv:1904.03323)
        • Peng Y.
        • Yan S.
        • Lu Z.
        Transfer learning in biomedical natural language processing: An evaluation of bert and elmo on ten benchmarking datasets.
        2019 (arXiv preprint arXiv:1906.05474)
        • Beltagy I.
        • Lo K.
        • Cohan A.
        SciBERT: A pretrained language model for scientific text.
        2019 (arXiv preprint arXiv:1903.10676)
        • Gu Y.
        • Tinn R.
        • Cheng H.
        • Lucas M.
        • Usuyama N.
        • Liu X.
        • Naumann T.
        • Gao J.
        • Poon H.
        Domain-specific language model pretraining for biomedical natural language processing.
        2020 (arXiv preprint arXiv:2007.15779)
      1. J. Pennington, R. Socher, C.D. Manning, Glove: Global vectors for word representation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP, 2014, pp. 1532–1543.

        • Arora S.
        • May A.
        • Zhang J.
        • Ré C.
        Contextual embeddings: When are they worth it?.
        2020 (arXiv preprint arXiv:2005.09117)
        • Aronson A.R.
        • Lang F.-M.
        An overview of MetaMap: historical perspective and recent advances.
        J. Am. Med. Inf. Assoc. 2010; 17: 229-236
        • Savova G.K.
        • Masanz J.J.
        • Ogren P.V.
        • Zheng J.
        • Sohn S.
        • Kipper-Schuler K.C.
        • Chute C.G.
        Mayo clinical text analysis and knowledge extraction system (cTAKES): architecture, component evaluation and applications.
        J. Am. Med. Inf. Assoc. 2010; 17: 507-513
        • Zhang Y.
        • Zhang Y.
        • Qi P.
        • Manning C.D.
        • Langlotz C.P.
        Biomedical and clinical english model packages in the stanza python NLP library.
        2020 (arXiv preprint arXiv:2007.14640)
        • Neumann M.
        • King D.
        • Beltagy I.
        • Ammar W.
        Scispacy: Fast and robust models for biomedical natural language processing.
        2019 (arXiv preprint arXiv:1902.07669)
        • Kocaman V.
        • Talby D.
        Spark NLP: natural language understanding at scale.
        Softw. Impacts. 2021; 8100058
        • Stubbs A.
        • Kotfila C.
        • Uzuner Ö.
        Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task track 1.
        J. Biomed. Inform. 2015; 58: S11-S19
        • Henry S.
        • Buchan K.
        • Filannino M.
        • Stubbs A.
        • Uzuner O.
        2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records.
        J. Am. Med. Inf. Assoc. 2020; 27: 3-12
      2. L. Ratinov, D. Roth, Design challenges and misconceptions in named entity recognition, in: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL-2009, 2009, pp. 147–155.

        • Ghaddar A.
        • Langlais P.
        Robust lexical features for improved neural network named-entity recognition.
        2018 (arXiv preprint arXiv:1806.03489)
        • Zaremba W.
        • Sutskever I.
        • Vinyals O.
        Recurrent neural network regularization.
        2014 (arXiv preprint arXiv:1409.2329)
        • Habibi M.
        • Weber L.
        • Neves M.
        • Wiegandt D.L.
        • Leser U.
        Deep learning with word embeddings improves biomedical named entity recognition.
        Bioinformatics. 2017; 33: i37-i48
        • Mikolov T.
        • Chen K.
        • Corrado G.
        • Dean J.
        Efficient estimation of word representations in vector space.
        2013 (arXiv preprint arXiv:1301.3781)
        • Wang X.
        • Zhang Y.
        • Ren X.
        • Zhang Y.
        • Zitnik M.
        • Shang J.
        • Langlotz C.
        • Han J.
        Cross-type biomedical named entity recognition with deep multi-task learning.
        Bioinformatics. 2019; 35: 1745-1752
        • Bhatia P.
        • Celikkaya B.
        • Khalilia M.
        • Senthivel S.
        Comprehend medical: a named entity recognition and relationship extraction web service.
        in: 2019 18th IEEE International Conference on Machine Learning and Applications, ICMLA IEEE, 2019: 1844-1851
        • Johnson A.E.
        • Pollard T.J.
        • Shen L.
        • Li-Wei H.L.
        • Feng M.
        • Ghassemi M.
        • Moody B.
        • Szolovits P.
        • Celi L.A.
        • Mark R.G.
        MIMIC-III, a freely accessible critical care database.
        Sci. Data. 2016; 3: 1-9
        • Bhatia P.
        • Celikkaya B.
        • Khalilia M.
        Joint entity extraction and assertion detection for clinical text.
        2018 (arXiv preprint arXiv:1812.05270)
        • Sun W.
        • Rumshisky A.
        • Uzuner O.
        Evaluating temporal relations in clinical text: 2012 i2b2 challenge.
        J. Am. Med. Inf. Assoc. 2013; 20: 806-813
        • Kass-Hout T.A.
        • Xu Z.
        • Mohebbi M.
        • Nelsen H.
        • Baker A.
        • Levine J.
        • Johanson E.
        • Bright R.A.
        OpenFDA: an innovative platform providing access to a wealth of FDA’s publicly available data.
        J. Am. Med. Inf. Assoc. 2016; 23: 596-600
        • Ramshaw L.
        • Marcus M.
        Text chunking using transformation-based learning.
        in: Third Workshop on Very Large Corpora. 1995 (URL https://aclanthology.org/W95-0107)
        • Schweter S.
        • Ahmed S.
        Deep-EOS: General-purpose neural networks for sentence boundary detection.
        in: KONVENS. 2019
        • Kocaman V.
        • Talby D.
        Biomedical named entity recognition at scale.
        in: International Conference on Pattern Recognition. Springer, 2021: 635-646
      3. A. Akbik, D. Blythe, R. Vollgraf, Contextual string embeddings for sequence labeling, in: Proceedings of the 27th International Conference on Computational Linguistics, 2018, pp. 1638–1649.

        • Yang X.
        • Lyu T.
        • Li Q.
        • Lee C.-Y.
        • Bian J.
        • Hogan W.R.
        • Wu Y.
        A study of deep learning methods for de-identification of clinical notes in cross-institute settings.
        BMC Med. Inf. Decis. Mak. 2019; 19: 232
        • Kocaman V.
        • Talby D.
        Improving clinical document understanding on COVID-19 research with spark NLP.
        2020 (arXiv preprint arXiv:2012.04005)
        • Haq H.U.
        • Kocaman V.
        • Talby D.
        Mining adverse drug reactions from unstructured mediums at scale.
        2022 (arXiv e-prints arXiv–2201)
        • Agarwal K.
        • Choudhury S.
        • Tipirneni S.
        • Mukherjee P.
        • Ham C.
        • Tamang S.
        • Baker M.
        • Tang S.
        • Kocaman V.
        • Gevaert O.
        • et al.
        Preparing for the next pandemic: Transfer learning from existing diseases via hierarchical multi-modal BERT models to predict COVID-19 outcomes.
        2021
      4. S. Choudhury, K. Agarwal, C. Ham, P. Mukherjee, S. Tang, S. Tipirneni, C. Reddy, S. Tamang, R. Rallo, V. Kocaman, Tracking the Evolution of COVID-19 via Temporal Comorbidity Analysis from Multi-Modal Data.

        • Dekermanjian J.
        • Labeikovsky W.
        • Ghosh D.
        • Kechris K.
        MSCAT: A machine learning assisted catalog of metabolomics software tools.
        Metabolites. 2021; 11: 678
        • Varol A.E.
        • Kocaman V.
        • Haq H.U.
        • Talby D.
        Understanding COVID-19 news coverage using medical NLP.
        2022 (arXiv preprint arXiv:2203.10338)