vid-SAMGRAH: A PyTorch framework for multi-latent space reinforcement learning driven video summarization in ultrasound imaging

The COVID-19 pandemic has accelerated the need for automatic triaging and summarization of ultrasound videos for fast access to pathologically relevant information in the Emergency Department and lowering resource requirements for telemedicine. In this work, a PyTorch based unsupervised reinforcement learning methodology which incorporates multi feature fusion to output classification labels, segmentation maps and summary videos for lung ultrasound is presented. The use of unsupervised training eliminates tedious manual labeling of key-frames by clinicians opening new frontiers in scalability in training using unlabeled or weakly labeled data. Our approach was benchmarked against expert clinicians from different geographies displaying superior Precision and F1 scores (over 80% and 44%).


Introduction
The COVID-19 pandemic has amplified the diagnostic potential of ultrasound imaging for continuous monitoring as it is free from ionizing radiation, portable and hardly requires dedicated facilities making it an economical diagnostic tool when compared to other imaging modalities [1][2][3]. But the implementation of it (or any imaging modality) in a practical scenario faces a two fold challenge viz, (1) the dearth of human experts (clinicians) to interpret large volumes of data generated, and (2) the time constraint of clinicians to provide inferences for large The code (and data) in this article has been certified as Reproducible by Code Ocean: (https://codeocean.com/). More information on the Reproducibility Badge Initiative is available at https://www.elsevier.com/physical-sciences-and-engineering/computer-science/journals. * Corresponding author. amounts of data. To bring a perspective of the volume of data generated, consider a typical ultrasound video for lung assessment at 30 fps for 30 s produces about a thousand frames for just a single assessment. This large volume of data generated poses two major challenges -(1) as the time of clinicians is limited and manually combing through huge volumes of data becomes impractical and, (2)   we have developed a fast and reliable methodology to interface the gap between machine data and human experts. A PyTorch based multilatent space reinforcement learning driven video summarization tool is presented which will help to extract the relevant key frames from a given ultrasound video. The key frames within the video convey diagnostically relevant information by overlaying machine classification score (healthy-unhealthy lung) and highlighting pathologically relevant features like B-lines, Pleura, and A-lines making it easier for the clinician to make the final decision. The tool will aid the human expert by providing key frames with overlaid pathological segmentations that reduce the amount of manual interventions as well as lower the storage and bandwidth requirements in telemedicine.

Description
To perform the task of video summarization and provide machine scores with pathological markings, an unsupervised multi-latent space based reinforcement learning (RL) methodology is employed. The algorithm is designed to provide robust and superior video summarization capability by carefully analyzing the components in the summarization pipeline like the encoder-decoder pair and the formation of the reward function. In this paper, the summarization pipeline is touched upon with a graphical abstract in Fig. 1. and the complete implementation of the algorithm and its details can be found in [4] (the link to the paper is provided in the GitHub repo [5]). In addition to the paper, the code and trained models are also readily accessible from the GitHub repo which includes a sample set of 4 ultrasound videos provided to demonstrate the working of the summarization algorithm.

Encoder-decoder
The formation of latent vectors from multiple encoders (Classification, Segmentation and Auto-Encoding) provide the summarization network different aspects of the lung image making it a robust methodology for ultrasound video summarization. Unlike natural images, ultrasound images are characterized by important pathological features which are crucial for diagnosis. For e.g., a classifier trained to distinguish between healthy and unhealthy lungs might focus on the presence of B-lines, Subpleural consolidations among other features to categorize while a segmentation network would focus on the texture and intensity to segment the lung into anatomical structures like Pleura, A and Blines whereas an autoencoder forms the most compact representation of the image frame and hence the three encoders are responsible for encoding different aspects of the lung image. By employing a fusion of the three (i.e the classification network, segmentation network and the auto-encoding network) we are able to obtain a representation of the image that is robust for summarization. The fusion of these multi-latent space features are performed using an attention mechanism to ensure relevant features are fused in order to obtain the best summarization possible. Since videos are frames in a sequence the summarization is performed using the LSTM architecture [6] which intuitively handles sequences and incorporates temporal information between frames. The LSTM classifies the frames in the videos as key frames (assigns probabilities to frames indicative of its importance to be included in the summary) which is then used to form a summary by selecting the top frame-probability scores.

Rewards
The summarization algorithm is trained following an unsupervised RL methodology [7,8]. The ground truth clinical annotation of keyframes in videos becomes impractical due to the clinician needing to comb through thousands of frames among the videos to annotate key frames. In ultrasound videos, keyframes are often few and far between which makes it even harder for clinicians to reliably label these videos. Using RL, training can be done by forming suitable reward functions to encourage the algorithm to select diagnostically relevant key frames based on pathological features without the need for manual labeling by a radiologist. The selection of such key frames are enforced by novel components of the reward such as (1) clsf : a score based on health state of the lung which helps preferentially pick out frames that are unhealthy (obtained from the classifier network), (2) ssim: a structural similarity index score to promote the selection of dissimilar frames and (3) rep + div: a representative and diversity score [7] to obtain a balanced summarization between representative and diverse frames of the video.

Results
The summarization network was trained using 100 lung ultrasound videos obtained from 3 countries during the pandemic (2020) and tested using an independent set of 26 ultrasound videos from 2021 which were annotated by expert clinicians from 3 geographies to form the ground truth to benchmark our system. The unsupervised summarization algorithm is trained using all the rewards mentioned in Section 2.2 and the results are presented in Table 1. The high Precision (80%), 1 -Score (44%) and Reduction (ReF) (77%) displays the superior and robust summarization capability of our approach of using attention based fusion of latent vectors for unsupervised schemes. Since we are limited to only selecting a small subset (less than one-fourth the length of the whole video for the summary as noted by the reduction factor (ReF) %) of equally likely summary frames (in ultrasound videos multiple frames convey similar diagnostic information), a lower value of recall is expected. It is immediately clear when comparing our results to non latent space fusion based approaches that the fusion of the encoders is essential to produce an accurate and robust video summary. A summarization result for the attention encoder fusion is presented in Fig. 2. to illustrate the summarization and overlaying of machine scores for lung health and segmentations of pathologically relevant features. In addition, we also validate the efficacy of our novel rewards that are introduced (clsf + ssim) which are tuned to the application of lung ultrasound video summarization. The results from Table 2. are supportive of the use of focused rewards that help the algorithm in preferential summarization of diagnostically relevant frames over general rewards.

Software impact
In the wake of the ongoing and post COVID-19 years, the necessity and application of telemedicine are well understood as a means of remote consultation [9]. One such is the point of care ultrasound (POCUS) which has been one of the central themes in telemedicine [10][11][12]. POCUS eliminates the need for the physical presence of an expert radiologist by offsetting it with large scan data obtained by sparingly trained technicians to encompass essential diagnostic information. Hence surplus data is to be sent to expert radiologists for referral via telemedicine. This poses a two fold problem (1) telemedicine is severely limited throughout the world by the available bandwidth and storage, (2) the time availability of expert radiologists. Thus developing software to make the obtained data suitable for telemedicine transmission and aid clinicians in making decisions quickly are of the utmost necessity and the work presented here is a step in that direction aimed at developing focused methods to make video summarization (in this case lung ultrasound) robust by using concepts from artificial intelligence.
The software developed herein has great capability in speeding up imaging and diagnosis/ prognostication of pulmonary cases that is particularly helpful during the COVID-19 pandemic when the medical facilities and practitioners are stretched to the limits. As caseloads increase exponentially, the proposed video summarization methodology helps in monitoring disease progression on a periodical basis and in evaluating the lung involvement for many patients using a simple set up of a portable ultrasound machine. From a societal perspective, this software innovation can transform emergency departments (ED), community scanning centers or patients at home with a mere ultrasound facility or POCUS into a good diagnostic location able to mimic radiology services of a full fledged hospital with sophisticated imaging modalities like CT that are beyond the reach of millions due to patient load, financial outgo and availability of radiologists.
With the introduction of this video summarization methodology two main challenges have been overcome. First, the need for an expert's physical presence, time and concentration for performing the study as identifying and interpreting anomalies had been considerably relaxed. This is achieved through software segregation of diagnostically relevant information from long ultrasound scans that can be performed by technicians or junior doctors with minimal experience. The algorithm summarizes the long videos into key frames that are diagnostically Fig. 3. Web-application deployment. The tool has the option to select the classification labels and the segmentation maps to be overlaid on the frames. Provision is also given to save the frame features. The tool will show the summary video, a collage of random 9 frames from the summary video and the frame numbers. relevant and provide highlights of machine scores (labels that predict whether the lung is healthy or unhealthy) as well as overlays pathological segmentations (A-lines, B-lines and Pleura). This highlighted summary can then be sent to expert radiologists via telemedicine for final diagnosis, making it easier for the radiologist at a remote location to judge the progression of the disease quickly. Second, the use of a summarization algorithm makes it possible to summarize large volumes of data into smaller sizes. The algorithm is capable of providing a robust summary with high precision by using just one-fourth the original video length. This reduces the storage and transmission bandwidth requirement by a quadruple factor thereby enabling better communication over low bandwidth internet connections that are typical in many remote locations. The final output achieved is a web-application software for video summarization that summarizes the given video succinctly and provides information related to diagnosing the case with machine overlaid scores and segmentations of pathologically relevant features as shown in Fig. 3.
Finally, unlike natural videos, medical videos are critical as missing information in the summary is deleterious. Medical videos across modalities are diverse -the different presets of the machines used, the skill of the radiologist involved in obtaining the scan, geographical differences etc. making it harder to obtain a robust summary. To overcome this challenge we propose the approach of latent vector fusion to increase the reliability by using a multi-feature map from the video to summarize it. This paves way for further research into standardizing procedures for scans, video and feature preprocessing to compensate for different presets in order to make summarization more reliable and robust.

Limitations and future development
The work presented here is a proof of concept towards a robust ultrasound video summarization software. At present the system is trained and validated with ultrasound scans from various countries and different ultrasound machines by separate clinicians. Future work would include analyzing the proposed system with better pruned US scans i.e. data from distinct machine vendors, standardized ultrasound scans with specific presets which are expected to further increase the performance of the proposed methodology. Also, the methodology described above is not limited to lung ultrasound and can easily be extrapolated to other ultrasound videos like wrist, elbow or liver ultrasound, as well as to other imaging modalities.

Conclusion
We have developed a robust ultrasound video summarization tool for lung ultrasound to aid clinicians by preferentially selecting diagnostically important frames for summarizing and overlaying them with machine lung health score as well as highlighting pathologically relevant features. This will make it easier and quicker for the clinicians to reach a diagnosis for the progression of the disease. This will also result in shorter videos making it suitable for storage and transmission in case of POCUS as well as telemedicine which is arguably the future trend in ultrasound radiology.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.