If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Named entity recognition (NER) is one of the most important building blocks of NLP tasks in the medical domain by extracting meaningful chunks from clinical notes and reports, which are then fed to downstream tasks like assertion status detection, entity resolution, relation extraction, and de-identification. Due to the growing volume of healthcare data in unstructured format, an increasingly important challenge is providing high accuracy implementations of state-of-the-art deep learning (DL) algorithms at scale.
•
While recent advances in NLP like Transformers and BERT have pushed the boundaries for accuracy, these methods are significantly slow and difficult to scale on millions of records.
•
In this study, we introduce an agile, production-grade clinical and biomedical NER algorithm based on a modified BiLSTM-CNN-Char DL architecture built on top of Apache Spark.
•
Our NER implementation establishes new state-of-the-art accuracy on 7 of 8 well-known biomedical NER benchmarks and 3 clinical concept extraction challenges: 2010 i2b2/VA clinical concept extraction, 2014 n2c2 de-identification, and 2018 n2c2 medication extraction. Moreover, clinical NER models trained using this implementation outperform the accuracy of commercial entity extraction solutions such AWS Medical Comprehend and Google Cloud Healthcare API by a large margin (8.9% and 6.7% respectively), without using memory-intensive language models.
•
The proposed model requires no handcrafted features or task-specific resources, requires minimal hyperparameter tuning for a given dataset from any domain, can be trained with any embeddings including BERT, and can be trained to support more human languages with no code changes. It is available within a production-grade code base as part of the Spark NLP library, the only open-source NLP library that can scale to make use of a Spark cluster for training and inference, has GPU support, and provides libraries for Python, R, Scala and Java.
Abstract
We introduce an agile, production-grade clinical and biomedical Named entity recognition (NER) algorithm based on a modified BiLSTM-CNN-Char DL architecture built on top of Apache Spark. Our NER implementation establishes new state-of-the-art accuracy on 7 of 8 well-known biomedical NER benchmarks and 3 clinical concept extraction challenges: 2010 i2b2/VA clinical concept extraction, 2014 n2c2 de-identification, and 2018 n2c2 medication extraction. Moreover, clinical NER models trained using this implementation outperform the accuracy of commercial entity extraction solutions, AWS Medical Comprehend and Google Cloud Healthcare API by a large margin (8.9% and 6.7% respectively), without using memory-intensive language models.
Electronic health records (EHRs) are the primary source of information for clinicians tracking the care of their patients. The EHR of a large medical organization can capture the medical transactions of over 10 million patients throughout the course of a decade. A single hospitalization alone typically generates around 150,000 pieces of data. The potential benefits derived from this data are significant. In aggregate, an EHR of this scale represents 200,000 years of doctor wisdom and 100 million years of patient outcome data, covering a plethora of rare conditions and maladies [
]. These records include information such as the reason for administering drugs, previous disorders of the patient, and the outcome of past treatments. It is estimated that unstructured data accounts for more than 80% of currently available healthcare data [
], EHRs are the largest source of empirical data in biomedical research, and unlocking this information and making it available for the downstream analysis can significantly advance biomedical and clinical research.
The widespread adoption of EHRs and the growing wealth of digitized information sources about patients are opening new doors to uncover previously unidentified associations and accelerating knowledge discovery via state-of-the-art Machine Learning (ML) algorithms and new statistical methods. Due to innate obstacles in extracting information from unstructured text data and the high level of preciseness dictated in the healthcare domain, manual data abstraction has been prevalent in the industry.
It is estimated that unstructured data accounts for more than 80% of currently available healthcare data [
]. However, automatically extracting common trends or other insights from EHRs requires time-intensive manual review and the extracted data could be used for clinical research, accurate clinical modeling or other administrative tasks. Information fed into these systems may be found in structured fields for which values are inputted electronically (e.g. laboratory test orders or results) [
]. Since manual abstraction is an expensive, time consuming and error prone process, there is a growing need for natural language processing (NLP) applications that automate the clinical abstraction process and make EHR data available through fast, scalable, and secure data pipelines.
Extracting valuable information from EHRs with intelligent systems starts with Named entity recognition (NER), a key building block of common NLP tasks such as question answering, topic modeling, information retrieval, etc [
]. In the medical domain, NER plays the most crucial role by giving out the first meaningful chunks of a clinical note, and then feeding them as an input to the subsequent downstream tasks such as clinical assertion status [
]. However, segmentation of clinical and drug entities is considered to be a difficult task in biomedical NER systems because of complex orthographic structures of named entities [
ML methods formulate the clinical NER task as a sequence labeling problem that aims to find the best label sequence for a given input sequence (individual words from clinical text) [
], which is one of the most widely used solution among conventional ML algorithms. A typical state-of-the-art clinical NER system usually utilizes features from different linguistic levels, including orthographic information (e.g., capitalization of letters, prefix and suffix), syntactic information (e.g., part of speech (POS) tags), word n-grams, word embeddings, and semantic information (e.g., the UMLS concept unique identifier) [
] have been crafted to address biomedical and clinical NER tasks with state-of-the-art results. However, since these methods require significant computational resources during both training and inference, using them in production is often impractical under the restricted computational resources compared to classical pretrained embeddings (e.g. Glove [
J. Pennington, R. Socher, C.D. Manning, Glove: Global vectors for word representation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP, 2014, pp. 1532–1543.
] empirically shows that classical pretrained embeddings can match contextual embeddings on industry-scale data, and often perform within 5 to 10% accuracy (absolute) on benchmark tasks.
Despite the growing interest and all these ground breaking advances in NER systems, easy to use production ready models and tools are scarce. It is a major obstacles for clinical NLP practitioners to implement the latest algorithms into their data pipelines and apply them quickly. On the other hand, NLP toolkits specialized for processing biomedical and clinical text, such as MetaMap [
] took the stage to find a solution to the issues discussed above and released new Python-based software libraries. Both libraries offer out-of-the-box clinical and biomedical pretrained NER models utilizing the state-of-the-art deep learning frameworks mentioned above. However none of these libraries or tools can scale up to leverage compute clusters without compromising the accuracy, nor support in-memory distributed data processing solutions such as Apache Spark.
] achieves new state-of-the-art results on popular biomedical benchmark datasets and clinical concept extraction challenges like the 2010 i2b2/VA challenge [
by a substantial margin (8.9% and 6.7% respectively) — without using memory-intensive contextual embeddings like BERT. Using the modified version of the well known BiLSTM-CNN-Char NER architecture [
] into Apache Spark environment, this NER module can already be extended to other spoken languages with zero code changes and can scale up in Spark clusters. The sample code regarding NER models can be found at the official github repository.
The specific novel contributions of this paper are the following:
•
Delivering the first production-grade, scalable NER implementation, that is easy to train on any dataset without architectural modification.
•
Delivering state-of-the-art NER models that achieve top scores on biomedical and clinical NER benchmark datasets and exceed commercial entity extraction solutions such as AWS Medical Comprehend and Google Cloud Healthcare API.
•
Explaining the NER model implementation in Spark NLP, which is the only NLP library that can scale up in Spark clusters and supports multiple popular programming languages (Python, R, Scala and Java).
2. Implementation
The deep neural network architecture for NER model in Spark NLP is BiLSTM-CNN-Char framework, a slightly modified version of the architecture proposed by Chiu et al. [
]. It is a neural network architecture that automatically detects word and character-level features using a hybrid bidirectional LSTM and Convolutional Neural Network (CNN) architecture, eliminating the need for most feature engineering steps.
In the original framework, the CNN extracts a fixed length feature vector from character-level features. For each word, these vectors are concatenated and fed to the BLSTM network and then to the output layers. They employed a stacked bi-directional recurrent neural network with long short-term memory units to transform word features into named entity tag scores. The extracted features of each word are fed into a forward LSTM network and a backward LSTM network. The output of each network at each time step is decoded by a linear layer and a log-softmax layer into log-probabilities for each tag category. These two vectors are then simply added together to produce the final output. In sum, 50-dimensional pretrained word embeddings is used for word features, 25-dimension character embeddings is used for char features, and capitalization features (allCaps, upperInitial, lowercase, mixedCaps, noinfo) are used for case features. They also made use of lexicons as a form of external knowledge as proposed in [
L. Ratinov, D. Roth, Design challenges and misconceptions in named entity recognition, in: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL-2009, 2009, pp. 147–155.
] through robust lexical features, after experimenting with different parameters and components, we decided to remove lexical features in order to reduce the complexity and relied on pretrained biomedical embeddings, casing features and char features through CNN. As sentences are represented through 2 nested sequences (words & chars), a CNN is applied in a way that each character is embedded in a character embedding matrix, of dimension 25. Then, a 1D Convolution layer processes the sequence of embedded char vectors, followed by a MaxPooling operation. This way, each word gets a vector representation. We used 25 filters and kernel size of 3. It is worth to mention that char features are proved to be highly useful in NER models and had provided a level of immunity to typos and spelling errors. Fig. 1 explains the overall architecture of our NER model 1.
We implement the model in Tensorflow (TF), and used LSTMBlockFusedCell; which is an extremely efficient LSTM implementation based on [
], and uses a single TF operation for the entire LSTM. These experiments show that it is both faster and more memory-efficient than LSTMBlockCell. Then we implemented this framework in Scala using TensorFlow API. This setup is ported into Spark and lets the driver node run the entire training using all the available cores on the driver node. Spark NLP also provides a CuDA version of each TensorFlow component which utilize GPU acceleration when available.
While the NER architecture extracts character level features from word tokens, the most useful features come from semantic vectors for each token. Due to proven efficiency of domain specific word embeddings in NER tasks [
], for learning distributed representations of words using contextual information. The trained word embeddings have 200-dimensions and a vocabulary size of 2.2 million. In order to compare the effectiveness of using biomedical data in our trained embeddings, we also used 300-dimension pretrained GloVe embeddings with 6 billion tokens (denoted by GloVe 6B), trained on Wikipedia and Gigaword-5 dataset [
J. Pennington, R. Socher, C.D. Manning, Glove: Global vectors for word representation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP, 2014, pp. 1532–1543.
]. The average word coverage of this implementation of domain specific word embeddings is 99.5% and the coverage of Glove6B embeddings is 96.1% on the biomedical datasets used in this study.
2.1 Experiments
In this section, we describe the datasets, evaluation metrics, and provide an overview of experimental setup. We basically set up three different experiments:
•
Performance and timing evaluation on biomedical NER datasets using the same DL architecture in TF (via Keras) and Spark NLP.
•
Performance evaluation on certain clinical datasets studied in i2b2 and n2c2 challenges in the past.
•
Benchmarking on commercial entity extraction solutions, AWS Medical Comprehend and Google Cloud Healthcare API.
2.1.1 Datasets
The development of NLP systems is contingent on access to relevant data and EHRs are notoriously difficult to obtain because of privacy reasons. Despite the recent efforts to de-identify (Appendix 7) and release narrative EHRs for research, these datasets are still very rare. As a result, clinical NLP, as a field has lagged behind, and to address this problem, researchers have released various data sources.
For our first experiment, we trained individual NER models on 8 publicly available biomedical NER datasets provided by [
]: AnatEM, BC5CDR, BC4CHEMD, BioNLP13CG, JNLPBA, Linnaeus, NCBI-Disease and S800. These models cover a wide variety of entity types in domains ranging from anatomical analysis to genetics and cellular biology. For the sake of brevity, we do not include details about the nature of the datasets and readers can refer to cited papers for more information.
Fig. 1Our proposed NER architecture. While the feature vectors for character embeddings and casing features have fixed dimensions, the main context is provided by token embeddings, which can be generated using multiple approaches (e.g., GloVE, BERT), and can have different dimensions.
Our second experiment is based on 3 different clinical concept extraction challenges. These challenges are organized by researchers at National NLP Clinical Challenges (n2c2) (formerly known as i2b2 — Informatics for Integrating Biology and the Bedside). These challenges have multiple subtasks, but, aligning with the scope of this paper, we put our focus only on the NER subtasks. These challenges are supported with annotated datasets, that we use for our experiments. We used the 2010 i2b2/VA challenge [
] to test performance on medicine/drug entities. Concise details of these datasets are available in Appendix 7, while extensive details can be found in the cited papers.
Our third experiment is a detailed comparison of our proposed solution with commercially available NLP solutions. While there are several commercial NLP solutions that make it easy to use machine learning to extract relevant medical information from unstructured text. AWS Medical Comprehend (AMC) [
] and Google Cloud Platform (GCP) Healthcare NLP are the most widely used and popular services at the moment. Since the data used by both services is confidential, and given the fact that it is highly expensive and time consuming to develop in-house datasets, we can strongly assume that apart from the proprietary in-house datasets, they must have included publicly available datasets as well for training purposes. To test this hypothesis, and for a fair comparison, we sampled one thousand clinical notes as a test set from MIMIC-III dataset [
] and got them annotated by two medical doctors to populate ground truths for 3 types of entities: Problem, Test and Drug. We decided on these entity types due to the fact that AMC and GCP both have these entities in their tool and they are high in number in the source datasets. The details about the entity mappings between these services can be found at Appendix 7. Problem and Test entities originally come from 2010 i2b2/VA and Drug entities come from 2018 n2c2 challenge, but AMC and GCP enriched & fine tuned their models with additional datasets to achieve a better generalization [
] tagging scheme as this is the primary tagging scheme we use for our NER implementation. There are other tagging schemes as well, like BIOES (Begin, Inside, Outside, End, Single), which is used in the original implementation of our NER architecture, reportedly has considerable performance improvements over IOB like converging quickly, and performing better while predicting token boundaries [
L. Ratinov, D. Roth, Design challenges and misconceptions in named entity recognition, in: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL-2009, 2009, pp. 147–155.
] since it has explicit boundary tags. However, we experienced various performance issues while using BIOES scheme like converging very fast in the early epochs but then failing to generalize further and getting stuck at local minima, and decided to use BIO scheme for our experiments. More details of other tagging schemes can found in Appendix 7.
server provided by Google (2vCPU @ 2.2 GHz, 13 GB RAM) and used Apache Spark in local mode (no cluster). We use standard train and test datasets for training and evaluating the models, and report both macro and micro-averaged F1 scores (excluding O’s). For comparison against existing benchmarks, papers, micro-averaging is used for fair comparison, as these papers primarily report micro scores, and consequently, we use the same policy for evaluating commercial APIs on clinical datasets. However, for comparisons which do not rely on external data (e.g., comparison between Keras and our implementation), macro-averaging is used as it is impervious to class imbalance problem, and more representative of the model’s ability to deal with a wide range of labels.
In terms of hyperparameter tuning, we run experiments by tuning the hyperparameters through Random Search to find the best parameter settings that would produce the best results. Final parameter values, along with testing ranges, can be found in Appendix 7.
3. Results
NER system is usually a part of an end-to-end NLP pipeline through which the text is fed and then several text preprocessing steps are applied. Since the DL algorithm we implement is a sentence-wise, and the features (embeddings and casing) are token-wise, sentence splitting and tokenization are the most important steps leading to a better accuracy. Using a DL based sentence detector module [
] and a highly customizable rule based tokenizer in Spark NLP, we ensured that the generated features are more informative. If a token, labeled as an entity in training set, cannot be isolated from its appended chars and punctuations, it might be treated as an out of vocabulary word while getting the embeddings and this harms the learning.
], we showed through extensive experiments that NER module in Spark NLP library exceeds the biomedical NER benchmarks reported by Stanza in 7 out of 8 benchmark datasets and in every dataset reported by SciSpacy without using heavy contextual embeddings like BERT. We present our results in Table 1. In addition to being the only NLP library capable of scaling up for training and inference in any Spark cluster, Spark NLP NER architecture also obtains new state-of-the-art results on seven public biomedical benchmarks without using memory-intensive contextual embeddings like BERT. Also, we report significant improvements in three benchmark datasets compared to Stanza, which include BC4CHEMD: 93.72% (4.1% gain), Species800: 80.91% (4.6% gain), and JNLPBA: 81.29% (5.2% gain).
The benchmarks show that the Spark NLP NER models with pretrained clinical embeddings produce better results than Stanza in 7 out of 8 biomedical datasets. It is also surprising to see despite using general-purpose embeddings (GloVe 6B), our NER model exceeds Stanza’s (which uses domain specific embeddings, CharLM - character-level language model [
A. Akbik, D. Blythe, R. Vollgraf, Contextual string embeddings for sequence labeling, in: Proceedings of the 27th International Conference on Computational Linguistics, 2018, pp. 1638–1649.
Table 1NER performance with in-house clinical glove embeddings, open source glove embeddings (6B) and Bert for Token Classification (BFTC) using biobert base cased across different datasets in the biomedical domain. All scores reported are micro-averaged test F1 excluding O’s. Stanza results are from the paper reported in
. The official training and validation sets are merged and used for training before testing the trained models on original test sets. For reproducibility purposes, we use the preprocessed versions of these datasets provided by
(also used by Stanza). Bold scores represent the best scores in the respective row (BFTC scores ignored for this analysis). The results can be reproduced with the Colab notebook shared in the official repository: https://github.com/JohnSnowLabs/spark-nlp-workshop.
Since the experiment in the previous section involved hyper parameter tuning (batch size and epoch count specifically), we define a new experiment for testing speed and accuracy metrics of our implementation in Spark with the standard Keras framework, while keeping the hyper parameters constant across all datasets. Also, to make the experiments reproducible, we use the same biomedical datasets, and standard Glove 6B embeddings mentioned above. We trained several models to compare the training speed and performance of our NER implementation in Spark with respect to the same DL architecture in TensorFlow using the Keras API.
Results show that the Spark NLP implementation beats the same architecture 7 out of 8 times in terms of macro F1 score and is faster to train in half of the datasets on a single machine, using all the available cores in both settings, as explained in Table 2. These results indicate that Spark NLP is not just a wrapper on a TensorFlow; The accuracy improvement mainly comes from our modified NER DL architecture as well as the Spark itself, using a state-of-the-art DAG (directed acyclic graph) scheduler, a query optimizer, and a physical execution engine. The code for TensorFlow implementation and biomedical NER datasets are shared as a supplementary material.
Table 2Performance evaluation on biomedical NER datasets using the same BiLSTM-CNN-Char architecture in TensorFlow and Spark NLP under the same settings for each dataset. The Spark NLP implementation beats the same architecture 7 out of 8 times in terms of macro F1 score and is faster to train in half of the datasets (macro average F1 score, embeddings glove6B_300d, lr 0.001, dropout 0.5, LSTM state size 200, epoch 10, batch size 128, optimizer Adam). Bold letters represent best results.
Using the official train sets from the n2c2 challenges explained in the previous section, we trained NER models, and obtained metrics on the official test sets used in the challenges. The results can be seen in Table 3. The results show that the proposed NER model performs better than the best results published so far on these datasets.
Table 3Performance metrics on 2010 i2b2/VA clinical concept extraction, 2014 i2b2 de-identification challenge and 2018 n2c2 medication extraction challenge. Scores indicate entity-level (span match) micro F1 scores (strict match, excluding O’s) on the official test scores. BERT-based benchmarks are omitted from this study to make a fair comparison between the similar DL architectures.
There are several commercial NLP solutions that make it easy to use machine learning to extract relevant medical information from unstructured text. AWS Medical Comprehend (AMC) [
] and Google Cloud Platform (GCP) Healthcare NLP are the most widely used and popular services at the moment. Fueled by high quality data that AWS and Google already have, their solutions are expected to be in the highest quality and generalize better on unseen clinical texts. We build custom testing data (explained in the previous section) to compare these commercial solutions with ours.
Since the sampled test set comes from a different distribution than that of the training set the models are trained on, the results are not as high as the ones on the official test sets as expected. Nevertheless, the clinical NER models trained with the Spark NLP implementation exceeds commercial entity extraction solutions by a large margin (8.9% and 6.7% respectively) without using memory-intensive language models like transformers. The benchmark results can be seen at Table 4 and the sampled annotations can be shared upon request.
Table 4Comparison of our NER models with AWS Medical Comprehend (AMC) and Google Cloud Platform (GCP) Healthcare API on randomly sampled 1000 clinical notes from MIMIC-III database. Tests run on three major entity classes (Problem, Test, Drug) and Spark NLP clinical NER models are 8.9% and 6.7% better than AMC and GCP respectively in average (macro F1 score).
Due to architectural design choices by Tensorflow implementation in JVM at the time of writing this paper, distributing the model training over the worker nodes in the cluster was not viable and effective, and putting the burden of entire training process on the driver node mandated some limitations in terms of computational resources. Nevertheless, being able to get predictions on scale from voluminous data with state-of-the-art accuracy would overwhelm the aforementioned disadvantage. As we explained more in the results section, training with the proposed model is also faster than the base version of the same architecture in plain TF.
In order to train large models on local machines with less memory, we implemented a dynamic memory optimization feature that can be turned on and off while setting up the training. Through this implementation, we get to decide if we want the features collected and generated at once and then feed into the network batch by batch or collected and generated by batch and then feed into the network in batches. This proved to be highly useful to speed up the training process on large memory machines as well as driver nodes.
We also implemented a batch prediction in which the Spark executor collects all the rows we have (up to a batch size), merge all the sentences found in this group of rows, feeds them to the network all together. Then the algorithm treats them as batched sentences, gives it back to the planner (a MapPartitions planner) and split them back again into rows. As a result, the features created and collected from the batch and sent to Tensorflow are batched in larger groups, giving us 2.5 to 3 times faster inference in production.
Distributed processing and cluster computing are mainly useful for processing a large amount of data and using Spark for small data would come with its cost. If we want to illustrate this with an analogy, we can say that Spark is like a locomotive racing a bicycle. The bike will win if the load is light, it is quicker to accelerate and more agile, but with a heavy load the locomotive might take a while to get up to speed, but it is going to be faster in the end. In order to get around the overhead of Spark on small data, we implemented a Lightpipeline concept in Spark NLP, equivalent to Spark ML Pipeline, but meant to deal with smaller amounts of data. Spark NLP LightPipelines are Spark ML pipelines converted into a single machine but the multi-threaded task, becoming more than 10x times faster for smaller amounts of data. To use them, we simply plug in a trained (fitted) pipeline and then annotate a plain text. We do not even need to convert the input text to data frame in order to feed it into a pipeline that is accepting data frame as an input in the first place. This feature is quite useful when it comes to getting a prediction for a few lines of text from a trained DL model before deploying the model on a cluster to process large volume of data.
5. Conclusion
Despite the growing interest and ground breaking advances in NLP research and NER systems, easy to use production ready models and tools are scarce in clinical and biomedical domain and it is one of the major obstacles for clinical NLP researchers to implement the latest algorithms into their workflow and start using immediately.
In this study, we show through experiments on various clinical and biomedical datasets that the NER module of the Spark NLP library requires no handcrafted features or task-specific resources, achieves state-of-the-art scores on popular biomedical datasets and clinical concept extraction challenges (2010 i2b2/VA, 2014 n2c2 de-identification and 2018 n2c2 medication extraction) and exceeds commercial entity extraction solutions, AWS Medical Comprehend and Google Cloud Healthcare API by 8.9% and 6.7% respectively without using heavy language models like transformers. Using the modified version of the well known BiLSTM-CNN-Char NER architecture as well as DL based sentence detector and highly customizable tokenization into Spark environment, even with a general purpose GloVe embeddings and no lexical features, we were able to achieve state-of-the-art results in biomedical domain and produced better results than Stanza in 4 out of 8 benchmark datasets. We also presented that our implementation beats the same architecture in Keras under the same settings 7 out of 8 times on biomedical datasets in terms of macro F1 score and is faster to train in half of the datasets on a single machine.
Spark NLP’s NER module can also be extended to other spoken languages with zero code changes and can scale up in Spark clusters. In addition, this model is available within a production-grade code base as part of the Spark NLP library; can scale up for training and inference in any Spark cluster; has GPU support and libraries for popular programming languages such as Python, R, Scala and Java; and is already extended to support other human languages with no code changes. Moreover, this model is available within a production-grade code base as part of the open-source Spark NLP library and a new NER model can be trained with a single line of code as presented in Appendix 7. At the time of writing running this study, Spark NLP for Healthcare provides more than 400 clinical entities from more than a hundred pretrained NER models (Fig. 2).
Fig. 2Supported NER models and named entities in Spark NLP for Healthcare.
Downloaded more than 25 million times, Spark NLP comes with more than 5 thousands pretrained models and its licensed extension, Spark NLP for Healthcare comes with more than 600 pretrained clinical models that are all developed and trained with latest state-of-the-art algorithms to solve real world problems in healthcare domain at scale. The NER algorithm investigated thoroughly in this study has widely been used in the academia and industry due to its paramount place within any Spark NLP pipeline. The same NER architecture obtains new state-of-the-art results on seven public biomedical benchmarks [
] introduced a pretrained NER models that can be used with text classifiers and relation extractions to extract adverse drug reactions from unregulated mediums such as social media; [
S. Choudhury, K. Agarwal, C. Ham, P. Mukherjee, S. Tang, S. Tipirneni, C. Reddy, S. Tamang, R. Rallo, V. Kocaman, Tracking the Evolution of COVID-19 via Temporal Comorbidity Analysis from Multi-Modal Data.
] used one of the pretrained NER models to extract clinical risk factors from free text notes to create features for developing predictive models from multi-modal electronic healthcare records by leveraging information from more prevalent diseases with shared clinical characteristics; [
] studied the impact of pretrained NER models to analyze Covid-19 related news in two prominent media outlets (CNN and Guardian) and analyzed the key entities, phrases, biases, and how they change over time in news coverage by correlating mined medical symptoms, procedures, drugs, and guidance with commonly mentioned demographic and occupational groups. The same study also analyzed the extraction of Adverse Drug Events about drug and vaccine manufacturers, which when reported by major news outlets has an impact on vaccine hesitancy.
CRediT authorship contribution statement
Veysel Kocaman: Conceptualization, Methodology, Writing – original draft, Software, Data curation. David Talby: Software, Investigation, Supervision, Writing – review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
We thank our colleagues and research partners who contributed in the former and current developments of Spark NLP library. We also thank our users and customers who helped us improve the library with their feedbacks and suggestions.
Appendix A. Supplementary data
The following is the Supplementary material related to this article.
J. Pennington, R. Socher, C.D. Manning, Glove: Global vectors for word representation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP, 2014, pp. 1532–1543.
L. Ratinov, D. Roth, Design challenges and misconceptions in named entity recognition, in: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL-2009, 2009, pp. 147–155.
A. Akbik, D. Blythe, R. Vollgraf, Contextual string embeddings for sequence labeling, in: Proceedings of the 27th International Conference on Computational Linguistics, 2018, pp. 1638–1649.
S. Choudhury, K. Agarwal, C. Ham, P. Mukherjee, S. Tang, S. Tipirneni, C. Reddy, S. Tamang, R. Rallo, V. Kocaman, Tracking the Evolution of COVID-19 via Temporal Comorbidity Analysis from Multi-Modal Data.