COVID-19 Lung Lesion Segmentation Using a Sparsely Supervised Mask R-CNN on Chest X-rays Automatically Computed from Volumetric CTs

Presented at the 2021 Annual Meeting of the Society for Imaging Informatics in Medicine (SIIM21)
Cite as:
V. Ramesh, B. Rister, and D. L. Rubin.
“COVID-19 Lung Lesion Segmentation 
Using a Sparsely Supervised Mask R-CNN
on Chest X-rays Automatically Computed 
from Volumetric CTs.” arXiv:2105.08147
[eess.IV], May 2021.

Chest X-rays of COVID-19 patients are frequently obtained to determine the extent of lung disease and are a valuable source of data for creating AI models. Most work to date assessing disease severity on chest imaging has focused on segmenting CT images; however, given that CTs are performed much less frequently than chest X-rays for COVID-19 patients, automated lung lesion segmentation on chest X-rays is clinically valuable. To accelerate severity detection and augment the amount of publicly available chest X-ray training data for supervised DL models, we propose an automated pipeline for segmentation of COVID-19 lung lesions on chest X-rays comprised of a Mask R-CNN trained on CheXMix, our newly released dataset containing a mixture of open-source chest X-rays and coronal X-ray projections computed from annotated volumetric CTs.

Read Paper View Code Watch Presentation View Dataset

We develop an automated pipeline for COVID-19 lung lesion segmentation on chest X-rays. Due to the lack of publicly available annotated chest X-ray data, we implement a pixel-based algorithm (a method operating at the pixel level) that generates coronal X-ray projections from annotated volumetric CTs to augment the training dataset. A Mask R-CNN framework is then trained on this mixed dataset. Our model achieves superior accuracy with only limited supervised training.

CT → CXR Conversion

We implement a pixel-based re-projection method, modeled as a sub-problem of ray tracing[1], to compute chest X-rays as coronal projections of volumes of axial CT slices:

Equation 1: CT to X-ray Conversion
We perform the following summation:

$\Theta(x,z) = \sum_{i=1}^{Y} \Phi(x,i,z) \quad \forall (x,z) \in [1, X] × [1, Z] ,$

where $\Phi$ is an $X \times Y \times Z$ 3D array denoting the volumetric CT and $\Theta$ is an $X \times Z$ matrix denoting the computed X-ray.

The algorithm is implemented in Python as follows:

Algorithm 1: create_CXR()
from PIL import Image
import numpy as np

def create_CXR(images): # images is a list of axial CT slices
  for z in range(len(images)): 
    img = Image.open(images[z]).convert('L')  # convert image to 8-bit grayscale
    HEIGHT, WIDTH = img.size
    data = list(img.getdata()) # convert image data to a list of integers
    # convert that to 2D list (list of lists of integers)
    pixels = [data[offset:offset+WIDTH] for offset in range(0, WIDTH*HEIGHT, WIDTH)]

    xray = np.zeros((len(images), Image.open(images[0]).convert('L').size[0]))

    # Loop from left to right on the CT slice
    for x in range(WIDTH):
      # Sum y values in the current x column
      sum = 0
      for y in range(HEIGHT):
        sum += pixels[y][x]
        # Assign sum to the point (x, z) on the coronal image - p[z][x] in the pixel array, 
        # since z represents height (rows) and x represents length (columns)
        xray[len(images) - 1 - z][x] = sum
  
  return xray

Our algorithm uses (a) a volumetric CT and (b) a volume of mask slices to compute (c) an annotated coronal CXR.

CheXMix: a chest X-ray dataset containing a mixture of patient X-rays and coronal CT projections with COVID-19 lung lesion annotations

We present CheXMix, the first publicly available, open-source chest X-ray dataset containing over 100 images (assembled from a variety of public sources) with COVID-19 lung lesion annotations produced by our Mask R-CNN model:

Lung Lesion Segmentation

We employ a naive implementation of the Mask R-CNN framework for the task of instance segmentation. In a Mask R-CNN architecture, training samples are fed into a ResNet-101 backbone network, convolved, and passed to the Region Proposal Network (RPN) to generate a set of proposed regions possibly containing lung lesions. Anchors corresponding with each region of interest are then passed through a series of feature maps to generate masks outlining COVID-19 lung lesions on the input chest X-ray. Object classes and bounding boxes are computed via a series of fully connected layers. The task of COVID-19 lung lesion segmentation is posed as a problem of binary classification between the image background and lung lesions. The final output is a predicted mask corresponding with the input chest X-ray, which can then be overlaid on the input image for clinical use.


Mask R-CNN network architecture. Training images are fed into a backbone network, which then passes a network representation of the training samples to the Region Proposal Network (RPN). The RPN scans the top-bottom pathway of the backbone network and proposes regions which may contain objects of interest on the feature map. RPN anchors are then fed into a series of feature maps, allowing for the parallel execution of two operations: 1) the creation of masks; and 2) the generation of object classes and bounding boxes using a series of fully connected layers.

Environmental Setup[2]

Our models were trained on a single GPU (Tesla P4 GPU provided by Google Colab, 16 GB memory). The code is implemented using TensorFlow v1, but is compatible with TensorFlow v2 and can be ported to the most recent version of TensorFlow if desired. To install all required dependencies, run the following:

pip install tensorflow==1.15.2 keras==2.1.0 Pillow scikit-image opencv-python numpy glob2 regex os-sys argparse matplotlib

Afterwards, set up the Mask R-CNN model:

git clone --quiet https://github.com/rvignav/Mask_RCNN.git
cd ~/Mask_RCNN
pip install -q PyDrive
python setup.py install
cp ~/Mask_RCNN/samples/balloon/balloon.py ./lesion.py
sed -i -- 's/balloon/lesion/g' lesion.py
sed -i -- 's/Balloon/Lesion/g' lesion.py

Training

The following commands can be used to train the Mask R-CNN model:

# Train a new model starting from pre-trained ImageNet weights
python lesion.py train --dataset='/path/to/data/' --weights=imagenet
    
# Train a new model starting from pre-trained COCO weights
python lesion.py train --dataset='/path/to/data/' --weights=coco
    
# Continue training a model that you had trained earlier
python lesion.py train --dataset='/path/to/data/' --weights=/path/to/weights/
    
# Continue training the last model you trained. This will find
# the last trained weights in the model directory.
python lesion.py train --dataset='/path/to/data/' --weights=last

To train with data augmentation[3], run:

python lesion.py train --dataset='/path/to/data/' --weights=imagenet/coco/last --aug='y'

The CT to X-ray re-projection algorithm can be executed in isolation as follows:

python ct2xray.py ⟨path/to/CT/volume⟩ ⟨path/to/mask/volume⟩

Pre-trained Models

Training dataset Train/test split Data augmentation (y/n) Checkpoint
X-rays only 60/40 y Download
Mixed 60/40 y Download
X-rays only 80/20 y Download
Mixed 80/20 y Download
X-rays only 80/20 n Download
Mixed 80/20 n Download

Results


Our model's results far exceed the few existing published baselines. For instance, Tang, Sun, and Li’s U-Net segmentation model* (the only published COVID-19 lung lesion segmentation framework with publicly available model schematics), when implemented and trained on Datasets 1 and 2, achieved Intersection over Union (IoU) scores of 0.38 ± 0.03 and 0.49 ± 0.03, respectively, both of which are significantly lower than our model’s corresponding IoU scores of 0.81 ± 0.03 and 0.79 ± 0.03. Since we trained and tested our model and the baseline model on the same datasets, our Mask R-CNN likely outperformed Tang, Sun, and Li’s U-Net segmentation architecture due to its structure as a series of recurring feature maps rather than contracting and expansive paths, the presence of the RPN, and its greater complexity in the form of a ResNet-101 backbone rather than a ResNet-18 backbone.

Furthermore, the fact that model achieved similar results after training on both Datasets 1 and 2 indicates that we can replace more than 83% of chest X-ray training images with X-ray projections generated from CTs while maintaining model accuracy.

Representative results are shown in the figures below.


Left: Ground truth and predicted masks. Sample of five chest X-rays from the test dataset (X-rays are the same within individual columns). The top row (white) displays the ground truth masks, the middle row (green) contains the masks predicted by the model after training on Dataset 1, and the bottom row (blue) contains the masks predicted by the model after training on Dataset 2.

Right: Overlaid segmentations. The top row displays the original X-rays, the middle row contains the predicted segmentations overlaid on the X-rays after training on Dataset 1, and the bottom row contains the predicted segmentations overlaid on the X-rays after training on Dataset 2.

What next?

A limitation of our study is that we used small amounts of publicly available data; however, our results still suggest that improved accuracy can be obtained by augmenting chest X-ray data with large numbers of frontal projections of public CT volumes. Training and testing our model on larger datasets could improve future results.




Footnotes

  1. the act of using the path that light takes through pixels to create images ↩︎

  2. Default Setting: We trained for $30$ epochs, each with $200$ training steps and $50$ validation steps, at batch size $2$. We used ResNet-101 as the base encoder network with backbone strides of $4$, $8$, $16$, $32$, and $64$ and a top-down pyramid size of $256$. For the losses, we used RPN class loss, RPN bounding box loss, Mask R-CNN class loss, Mask R-CNN bounding box loss, and Mask R-CNN mask loss. We used a gradient clip norm of $5.0$, an image shape of $1024 \times 1024 \times 3$, a learning momentum of $0.9$, a learning rate of $0.001$, a weight decay of $0.0001$, and a mask shape of $28 \times 28$. For the RPN specifications, we used anchor ratios of $0.5$, $1$, and $2$; anchor scales of $32$, $64$, $128$, $256$, and $512$; an anchor stride of $1$; a ROI positive ratio of $0.33$; bounding box standard deviations of $0.1$ and $0.2$; and $200$ training ROIs per image. The maximum number of ground truth instances is $15$. ↩︎

  3. The following augmentation techniques were applied to the training samples with specified probabilities in parentheses: horizontal flip ($0.5$), $0$-$10$% random crop ($1.0$), small Gaussian blur with randomly chosen $\sigma \in [0,0.5]$ ($0.5$), contrast normalization, per-channel pixel multiplication with $\delta \in [0.8,1.2]$ ($0.2$), affine transformation ($1.0$), scale, rotate, & shear. ↩︎