Understanding UNET

Kirill Bondarenko
6 min readJul 2, 2019

--

How to understand U-Net in the most simple way.

Carvana cars segmentation challenge example

Hello everyone!

In this article I want to explain in simple way the one of the most popular models structures to solve image segmentation task — UNET.

If you haven’t heard about it and haven’t seen its architecture, it’s not a problem, because in this article I will start with a simple structure and at the end will be traditional UNET. Let’s start.

Content

  1. What task does solve UNET ?
  2. MiUNET
  3. Original UNET

What task does solve UNET ?

UNET model was created for medicine purpose to find tumors in lungs or brain, but nowadays it has got much wider usage field.

For example your task is to find rectangles on images, no matter what color or shape they are.

Input data

We have a red one and yellow one rectangles on a green background. This is an input for UNET model.

We need to define positive regions on the image where we have rectangles as 1 and negative regions as 0 (like binary classification). If we change red and yellow colors pixels values to 1 and green region to 0 we will get a gray scaled image or binary mask or target (supervised learning term):

Target for model or image binary mask

UNET model will learn to find these white regions on the images. It works with any kind of objects: cats, humans, cars, buildings, roads etc.

MiUNET

MiUNET or mini UNET is a simplified and trimmed version of the original UNET. (and somewhere changed)

If you want to understand something difficult, the rightest way is to simplify and investigate every step to explain it in the simplest words even for a child. — ancient mathematician Kiryusha, 200 b.c.

First of all let’s take a data from the previous paragraph and add there shapes.

Simplified architecture will consist of the four parts: convolution, upsampling, concatenation and again convolution (but with other purpose).

Now, let’s go through the each stage and investigate it.

  1. Convolution+MaxPooling
W — width, Fw and Fh-kernel shape, P-padding, S-stride

Let’s create this layer with kernel of convolution = 3x3x3

Stride = 1

Padding = 0

And number of filters = 64

And the same for MaxPooling (filter shape 3x3x64).

Convolution output shape = (10–3 + 2*0)/1 +1 = 8

After convolution we get 8x8x64 image (tensor).

Applying MaxPooling we get (8–3 + 2*0)/1 +1 = 6 and the image shape will be 6x6x64.

Basically, we did this:

We reduced width and height, but increased depth from 3 channels RGB image to 64 layers depth image.

If you can’t operate with convolutional layers in easy way, don’t worry about it and learn by hard Andrew Ng whole course about convolutions. No other way.

2. Upsampling

What is upsampling ? We need to increase our image width and height to 8x8 (explained further) and there are few methods how to do upsampling, here will be shown “nearest neighbor”.

The key idea of this method might be shown just with a simple figure:

Upsampling NN example

We duplicate pixels values for each layer without any weights and other complex operations.

3. Concatenation

Here is an answer why we need to do upsampling up to 8, not to 10.

If you look at the picture in the beginning of this paragraph with general structure of the model, you will see, that convolution layer has two outputs with shapes 8x8x64 where one goes to maxpooling and another one goes to concatenation operation.

Concatenation is made by third axis (depth):

Here we concatenated output of the first one convolutional layer and output of upsampling layer and as the result we get 8x8x128 tensor.

4. Convolution

Here we go again. But now we will use convolution to increase width and height and reduce depth.

We will apply here two convolutional layers without maxpooling.

First one: F= 3x3x128, P=2, S=1, number of filters = 64.

We get output: 10x10x64.

Second one: F=3x3x64, P=1, S=1, number of filters = 1.

Result: 10x10x1

Great, now let’s look at the whole model.

We may see it like this.
Or like this

In this way we created a network that can work with images data and make segmentation task with returning same shape masks.

You may try it in Python using keras:

from keras import Input, Model
from keras.layers import Conv2D, MaxPooling2D, concatenate, UpSampling2D
from keras.optimizers import Adam


input_size = (None, None, 3)

inputs = Input(shape=input_size)
conv1 = Conv2D(64, 3, activation='relu', padding='same', kernel_initializer='he_normal')(inputs)
pool1 = MaxPooling2D(pool_size=(2, 2))(conv1)


conv2 = Conv2D(128, 3, activation='relu', padding='same', kernel_initializer='he_normal')(pool1)
pool2 = MaxPooling2D(pool_size=(2, 2))(conv2)


ups = UpSampling2D(size=(2, 2))(conv2)
up1 = Conv2D(64, 3, activation='relu', padding='same', kernel_initializer='he_normal')(ups)
merge1 = concatenate([conv1, up1], axis=3)


conv3 = Conv2D(64, 3, activation='relu', padding='same', kernel_initializer='he_normal')(merge1)
conv4 = Conv2D(1, 1, activation='sigmoid')(conv3)

model = Model(input=inputs, output=conv4)

model.compile(optimizer=Adam(lr=0.1), loss='binary_crossentropy')
print(model.summary())

In this condition your model will be able to work with any initial image data shape (height and width) and return the same ones.

Short look at the original UNET

On the left part of the model structure are blocks of convolutional layers + ReLU activations and MaxPooling layers. In this example initial image has a shape 572x572x1 , at the bottom the shape will be 28x28x1024 after convolutions and poolings.

On the right part of the model is going a process of reducing depth and increasing height and width. Going from the bottom to the up: 28x28x1024 →56x56x1536 (the lowest concatenation and first upsampling) →54x54x512 (convolution to reduce depth and reduce a bit W and H) →104x104x768 (second upsampling) →102x102x256 (convolution to reduce depth) →100x100x256 →200x200x384 →198x198x128 →196x196x128 →392x392x192 → 390x390x64 →388x388x64 → 388x388x2 .

In this model output has depth = 2. It may be a reference to the number of classes to segment, but in our task of binary segmentation output should be 388x388x1 as a binary gray scale mask.

Now you can see that UNET architecture isn’t difficult and trust me, you will spend much more time to prepare right masks and training data at all for the real world tasks.

I hoped, this article has helped you in understanding UNET.

Good luck !

With best wishes, Bondarenko K., machine learning engineer.

--

--

Responses (2)