# Automatic Assessment of Infant Sleep Safety Using Semantic Segmentation

### Introduction

NICHD reports that close to 4,000 infants die annually from sleep-related causes in the US. Sudden Unexpected Infant Death (SUID) is the leading cause of infant mortality in the US among children aged from 1 month to 1 year old. SUID is defined as the sudden and unexpected death of an infant in cases where the cause is not obvious before an investigation. After a full investigation, SUID may be classified as Sudden Infant Death Syndrome (SIDS), suffocation, trauma, metabolic diseases, or unknown. SIDS is defined as sudden, unexpected infant death that cannot be explained, including scene investigation, autopsy, and clinical history review.

Recently, machine learning (ML) has become an ultimate tool for image classification and segmentation and has shown quality results in both engineering, and healthcare applications. It allows computational models composed of several processing layers to learn representations of data with several levels of abstraction. ML uses two types of techniques: supervised learning, which forms a model on known input and output data so that it can predict future outcomes and unsupervised learning that finds hidden models or structures intrinsic in the input data.

Our approach relies on an efficient architecture for semantic segmentation, with the ability to model the appearance and shape, the goal of which is to understand the spatial-relationship among different classes. The resulting model is capable of recognizing safe and unsafe situations such as “prone position as wrong sleep position for infant” or “hazardous objects near the baby’s sleep environment”.

### Approach

We used  images taken by a convenience sample of teen mothers (TMs), who owned a smartphone and had an infant under 4 months old. The  images were independently assessed for coder reliability across five domains of infant safe sleep, including sleep location, surface, position, presence of soft items, and hazards near the sleep area. The criteria for safe sleep practices and environment are defined in Figue 1. The criteria provided the final class values in which we classified each pixel in the images. They help to determine if a sleep environment is safe or not. To do so, each criterion is assigned a specific. From a copy of every image, we manually label each pixel according to the color if it belongs to one of the criterions.

### Model’s Architecture

The network architecture is illustrated in Figure 2. The convolutional neural network architecture used here is based on the SegNet architecture for semantic segmentation. The internal architecture of the machine learning model is an encoder-decoder network, followed by a pixel-wise classification layer. This network uses a VGG-style encoder-decoder, where the upsampling in the decoder is done using transposed convolutions. The encoder is a collection of convolutional layers designed to extract feature maps for object classification. It contains 6 convolution layers and 3 pooling layers. The decoder has the same number of layers as the encoder and upsamples its input feature maps using the memorized indices from the corresponding encoder feature maps. The last convolution layer of the decoder feeds a soft-max classifier.

#### Encoder

Every encoder in the network performs convolution with a filter bank to produce a set of high-dimensional feature vectors. A batch normalization is then applied, followed by an element-wise rectified linear non-linearity (ReLU) $max(0,x)$. After that, a $2 \times 2$ max pooling window with a strike of 2 (non-overlapping window) is performed, and the resulting output is down-sampled by a factor of 2. Although multiple layers of max-pooling provide more invariant translation for robust classification, this also leads to a loss of feature maps’ spatial resolution. Losing the detail of the boundaries of an image is detrimental because boundary delimitation is essential to the segmentation quality. Therefore, the boundary information in the encoder feature maps must be stored before sub-sampling is performed. It is done by storing the location of the maximum feature value in each pooling window.

#### Decoder

The decoder stage is composed of a set of up-sampling and convolution layers. Each up-sampling layer in the decoder corresponds to a max-pooling layer in the encoder. These layers up sample the feature maps using the max-pooling indices of their corresponding feature maps in the encoder phase. The upsampled maps are then convoluted with a set of filter banks that can be trained to produce dense feature maps. When feature mappings have been restored to the original resolution, they are passed to the SoftMax classifier to produce the final segmentation.

### Training

We build our own dataset called Safe Sleep dataset. This dataset is relatively small, consisting of 486 training and 120 testing color images (day and dusk scenes) at 320 X 240 resolutions. The data set consists of infants in different bed types, positions, clothes and situations. To increase the dataset size, we used some data augmentation techniques such as horizontal flip, vertical flip, 2D random rotation, and brightness alteration. Overall, The model was able to acheive a maximum accuracy of 81%.

Note that this site is work in progress.