Chapters

Hide chapters

Machine Learning by Tutorials

Second Edition · iOS 13 · Swift 5.1 · Xcode 11

Before You Begin

Section 0: 3 chapters
Show chapters Hide chapters

Section I: Machine Learning with Images

Section 1: 10 chapters
Show chapters Hide chapters

8. Advanced Convolutional Neural Networks
Written by Matthijs Hollemans

Heads up... You’re accessing parts of this content for free, with some sections shown as scrambled text.

Heads up... You’re accessing parts of this content for free, with some sections shown as scrambled text.

Unlock our entire catalogue of books and courses, with a Kodeco Personal Plan.

Unlock now

SqueezeNet

What you did in the previous chapter is very similar to what Create ML and Turi Create do when they train models, except the convnet they use is a little more advanced. Turi Create actually gives you a choice between different convnets:

  • SqueezeNet v1.1
  • ResNet50
  • VisionFeaturePrint_Scene

In this section, you’ll take a quick look at the architecture of SqueezeNet and how it is different from the simple convnet you made. ResNet50 is a model that is used a lot in deep learning, but, at over 25 million parameters, it’s on the big side for use on mobile devices and so we’ll pay it no further attention.

We’d love to show you the architecture for VisionFeaturePrint_Scene, but, alas, this model is built into iOS itself and so we don’t know what it actually looks like.

This is SqueezeNet, zoomed out:

The architecture of SqueezeNet
The architecture of SqueezeNet

SqueezeNet uses the now-familiar Conv2D and MaxPooling2D layers, as well as the ReLU activation. However, it also has a branching structure that looks like this:

The fire module
The fire module

This combination of several different layers is called a fire module, because no one reads your research papers unless you come up with a cool name for your inventions. SqueezeNet is simply a whole bunch of these fire modules stacked together.

In SqueezeNet, most of the convolution layers do not use 3×3 windows but windows consisting of a single pixel, also called 1×1 convolution. Such convolution filters only look at a single pixel at a time and not at any of that pixel’s neighbors. The math is just a regular dot product across the channels for that pixel.

Convolutions with a 1×1 kernel size are very common in modern convnets. They’re often used to increase or to decrease the number of channels in a tensor. That’s exactly why SqueezeNet uses them, too.

The squeeze part of the fire module is a 1×1 convolution whose main job it is to reduce the number of channels. For example, the very first layer in SqueezeNet is a regular 3×3 convolution with 64 filters. The squeeze layer that follows it, reduces this back to 16 filters. What such a layer learns isn’t necessarily to detect patterns in the data, but how to keep only the most important patterns. This forces the model to focus on learning only things that truly matter.

The output from the squeeze convolution branches into two parallel convolutions, one with a 1×1 window size and the other with a 3×3 window. Both convolutions have 64 filters, which is why this is called the expand portion of the fire module, as these layers increase the number of channels again. Afterwards, the output tensors from these two parallel convolution layers are concatenated into one big tensor that has 128 channels.

The squeeze layer from the next fire module then reduces those 128 channels again to 16 channels, and so on. As is usual for convnets, the number of channels gradually increases the further you go into the network, but this pattern of reduce-and-expand repeats several times over.

The reason for using two parallel convolutions on the same data is that using a mix of different transformations potentially lets you extract more interesting information. You see similar ideas in the Inception modules from Google’s famous Inception-v3 model, which combines 1×1, 3×3, and 5×5 convolutions, and even pooling, into the same kind of parallel structure.

The fire module is very effective, evidenced by the fact that SqueezeNet is a powerful model — especially for one that only has 1.2 million learnable parameters. It scores about 67% correct on the snacks dataset, compared to 40% from the basic convnet of the previous section, which has about the same number of parameters.

If you’re curious, you can see a Keras version of SqueezeNet in the notebook SqueezeNet.ipynb in this chapter’s resources. This notebook reproduces the results from Turi Create with Keras. We’re not going to explain that code in detail here since you’ll shortly be using an architecture that gives better results than SqueezeNet. However, feel free to play with this notebook — it’s fast enough to run on your Mac, no GPU needed for this one.

The Keras functional API

One thing we should mention at this point is the Keras functional API. You’ve seen how to make a model using Sequential, but that is limited to linear pipelines that consist of layers in a row. To code SqueezeNet’s branching structures with Keras, you need to specify your model in a slightly different way.

img_input = Input(shape=input_shape)

x = Conv2D(64, 3, padding='valid')(img_input)
x = Activation('relu')(x)
x = MaxPooling2D(pool_size=(3, 3), strides=(2, 2))(x)
x = fire_module(x, squeeze=16, expand=64)
x = fire_module(x, squeeze=16, expand=64)
x = MaxPooling2D(pool_size=(3, 3), strides=(2, 2))(x)
...

model = Model(img_input, x)
...
return model
x = LayerName(parameters)
x = LayerName(parameters)(x)
model = Model(img_input, x)
def fire_module(x, squeeze=16, expand=64):
    sq = Conv2D(squeeze, 1, padding='valid')(x)
    sq = Activation('relu')(sq)

    left = Conv2D(expand, 1, padding='valid')(sq)
    left = Activation('relu')(left)

    right = Conv2D(expand, 3, padding='same')(sq)
    right = Activation('relu')(right)

    return concatenate([left, right])

MobileNet and data augmentation

The final classification model you’ll be training is based on MobileNet. Just like SqueezeNet, this is an architecture that is optimized for use on mobile devices — hence the name.

import os
import numpy as np
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import *
from keras import optimizers, callbacks
import keras.backend as K

%matplotlib inline
import matplotlib.pyplot as plt
image_width = 224
image_height = 224

from keras.applications.mobilenet import MobileNet

base_model = MobileNet(
  input_shape=(image_height, image_width, 3),
  include_top=False,
  weights="imagenet",
  pooling=None)
from keras.utils import plot_model
plot_model(base_model, to_file="mobilenet.png")
MobileNet uses depthwise separable convolutions
CinefoWig egiv xalxsfena sabepiwmu qipnenekoefg

Depthwise convolution treats the channels independently
Warmbsocu molteditoin spiecb tce wdukruzg idfikebmannwt

Adding the classifier

You’ve placed the MobileNet feature extractor in a variable named base_model. You’ll now create a second model for the classifier, to go on top of that base model:

num_classes = 20

top_model = Sequential()
top_model.add(base_model)
top_model.add(GlobalAveragePooling2D())
top_model.add(Dense(num_classes))
top_model.add(Activation("softmax"))
for layer in base_model.layers:
    layer.trainable = False
_______________________________________________________________
Layer (type)                 Output Shape              Param #
===============================================================
mobilenet_1.00_224 (Model)   (None, 7, 7, 1024)        3228864
_______________________________________________________________
global_average_pooling2d_2 ( (None, 1024)              0       
_______________________________________________________________
dense_1 (Dense)              (None, 20)                20500   
_______________________________________________________________
activation_1 (Activation)    (None, 20)                0       
===============================================================
Total params: 3,249,364
Trainable params: 20,500
Non-trainable params: 3,228,864
_______________________________________________________________
top_model.compile(loss="categorical_crossentropy",
                  optimizer=optimizers.Adam(lr=1e-3),
                  metrics=["accuracy"])

Data augmentation

We only have about 4800 images for our 20 categories, which comes to 240 images per category on average. That’s not bad, but these deep learning models work better with more data. More, more, more! Gathering more training images takes a lot of time and effort — therefore, is costly — and is not always a realistic option. However, you can always artificially expand the training set by transforming the images that you do have.

Whatta guy!
Myidve may!

Whatta guys!
Xvutqa bing!

from keras.applications.mobilenet import preprocess_input

train_datagen = ImageDataGenerator(
                    rotation_range=40,
                    width_shift_range=0.2,
                    height_shift_range=0.2,
                    shear_range=0.2,
                    zoom_range=0.2,
                    channel_shift_range=0.2,
                    horizontal_flip=True,
                    fill_mode="nearest",
                    preprocessing_function=preprocess_input)

val_datagen = ImageDataGenerator(
                    preprocessing_function=preprocess_input)

test_datagen = ImageDataGenerator(
                    preprocessing_function=preprocess_input)
images_dir = "snacks/"
train_data_dir = images_dir + "train/"
val_data_dir = images_dir + "val/"
test_data_dir = images_dir + "test/"
batch_size = 64

train_generator = train_datagen.flow_from_directory(
                    train_data_dir,
                    target_size=(image_width, image_height),
                    batch_size=batch_size,
                    class_mode="categorical",
                    shuffle=True)

val_generator = val_datagen.flow_from_directory(
                    val_data_dir,
                    target_size=(image_width, image_height),
                    batch_size=batch_size,
                    class_mode="categorical",
                    shuffle=False)

test_generator = test_datagen.flow_from_directory(
                    test_data_dir,
                    target_size=(image_width, image_height),
                    batch_size=batch_size,
                    class_mode="categorical",
                    shuffle=False)

Training the classifier layer

Training this model is no different than what you’ve done before: you can run model.fit_generator() a few times until you’re happy with the validation accuracy.

checkpoint_dir = "checkpoints/"
checkpoint_name = (checkpoint_dir
    + "multisnacks-{val_loss:.4f}-{val_acc:.4f}.hdf5")

if not os.path.exists(checkpoint_dir):
    os.makedirs(checkpoint_dir)

def create_callbacks():
    return [
        callbacks.EarlyStopping(
            monitor="val_acc",
            patience=10,
            verbose=1),
        callbacks.ModelCheckpoint(
            checkpoint_name,
            monitor="val_acc",
            verbose=1,
            save_best_only=True),
    ]

my_callbacks = create_callbacks()
histories = []
histories.append(top_model.fit_generator(
  train_generator,
  steps_per_epoch=len(train_generator),
  epochs=10,
  callbacks=my_callbacks,
  validation_data=val_generator,
  validation_steps=len(val_generator),
  workers=8))
def combine_histories():
    history = {
        "loss": [],
        "val_loss": [],
        "acc": [],
        "val_acc": []
    }

    for h in histories:
        for k in history.keys():
            history[k] += h.history[k]
    return history

history = combine_histories()

def plot_accuracy(history):
    fig = plt.figure(figsize=(10, 6))
    plt.plot(history["acc"])
    plt.plot(history["val_acc"])
    plt.xlabel("Epoch")
    plt.ylabel("Accuracy")
    plt.legend(["Train", "Validation"])
    plt.show()

plot_accuracy(history)
MobileNet accuracy for the first ten epochs
JoruzaWap ulhohotg fab qca yexkz buj oyujyd

Epoch 00010: val_acc did not improve from 0.70262
Epoch 00009: val_acc improved from 0.69215 to 0.70262, saving model to
checkpoints/multisnacks-1.0450-0.7026.hdf5

Fine-tuning the feature extractor

At this point, it’s a good idea to start fine-tuning the feature extractor. So far, you’ve been using the pre-trained MobileNet as the feature extractor. This was trained on the ImageNet dataset, which contains a large variety of photos from 1,000 different kinds of objects.

for layer in base_model.layers:
    layer.trainable = True

top_model.compile(loss="categorical_crossentropy",
                  optimizer=optimizers.Adam(lr=1e-4),
                  metrics=["accuracy"])    
K.set_value(top_model.optimizer.lr,
            K.get_value(top_model.optimizer.lr) / 3)
The loss curves (top) and accuracy curves (bottom)
Tzi codc takxaq (qan) egx eyqixusn nifqux (wudgoy)

top_model.evaluate_generator(test_generator,
    steps=len(test_generator))

Regularization and dropout

So you’ve got a model with a pretty decent score already, but notice in the above plots that there is a big gap between the training loss and validation loss. Also, the training accuracy keeps increasing — reaching almost 100% — while the validation accuracy flattens out and stops improving.

from keras import regularizers
top_model = Sequential()
top_model.add(base_model)
top_model.add(GlobalAveragePooling2D())
top_model.add(Dropout(0.5))              # this line is new
top_model.add(Dense(num_classes,
              kernel_regularizer=regularizers.l2(0.001))) # new
top_model.add(Activation("softmax"))

Tune those hyperparameters

You’ve seen three different hyperparameters now:

How good is the model really?

The very last training epoch is not necessarily the best — it’s possible the validation accuracy didn’t improve or even got much worse — so in order to evaluate the final model on the test set, let’s load the best model back in first:

from keras.models import load_model
best_model = load_model(checkpoint_dir +
                        "multisnacks-0.7162-0.8419.hdf5")
best_model.evaluate_generator(test_generator,
                              steps=len(test_generator))
test_generator.reset()
probabilities = best_model.predict_generator(test_generator,
                                    steps=len(test_generator))
predicted_labels = np.argmax(probabilities, axis=-1)
target_labels = test_generator.classes
from sklearn import metrics
conf = metrics.confusion_matrix(target_labels, predicted_labels)
import seaborn as sns

def plot_confusion_matrix(conf, labels, figsize=(8, 8)):
    fig = plt.figure(figsize=figsize)
    heatmap = sns.heatmap(conf, annot=True, fmt="d")
    heatmap.xaxis.set_ticklabels(labels, rotation=45,
                                 ha="right", fontsize=12)
    heatmap.yaxis.set_ticklabels(labels, rotation=0,
                                 ha="right", fontsize=12)
    plt.xlabel("Predicted label", fontsize=12)
    plt.ylabel("True label", fontsize=12)
    plt.show()

# Find the class names that correspond to the indices
labels = [""] * num_classes
for k, v in test_generator.class_indices.items():
    labels[v] = k

plot_confusion_matrix(conf, labels, figsize=(14, 14))
The confusion matrix for the MobileNet model
Vwi koqguqeac povvoc viv cmi PoxesaLok lusic

Precision, recall, F1-score

It’s also useful to make a precision-recall report:

print(metrics.classification_report(target_labels,
                     predicted_labels, target_names=labels))
              precision    recall  f1-score   support

       apple       0.95      0.80      0.87        50
      banana       0.91      0.96      0.93        50
        cake       0.70      0.76      0.73        50
       candy       0.90      0.88      0.89        50
      carrot       0.92      0.88      0.90        50
      cookie       0.81      0.78      0.80        50
    doughnut       0.88      0.90      0.89        50
       grape       0.94      0.96      0.95        50
     hot dog       0.90      0.88      0.89        50
   ice cream       0.88      0.74      0.80        50
       juice       0.94      0.96      0.95        50
      muffin       0.85      0.83      0.84        48
      orange       0.85      0.82      0.84        50
   pineapple       0.71      0.88      0.79        40
     popcorn       0.85      0.85      0.85        40
     pretzel       0.79      0.88      0.83        25
       salad       0.81      0.94      0.87        50
  strawberry       0.93      0.80      0.86        49
      waffle       0.94      0.90      0.92        50
  watermelon       0.81      0.88      0.85        50

    accuracy                           0.86       952
   macro avg       0.86      0.86      0.86       952
weighted avg       0.87      0.86      0.86       952
# Get the class index for pineapple
idx = test_generator.class_indices["pineapple"]

# Find how many images were predicted to be pineapple
total_predicted = np.sum(predicted_labels == idx)

# Find how many images really are pineapple (true positives)
correct = conf[idx, idx]

# The precision is then the true positives divided by
# the true + false positives
precision = correct / total_predicted
print(precision)
# Get the class index for ice cream
idx = test_generator.class_indices["ice cream"]

# Find how many images are supposed to be ice cream
total_expected = np.sum(target_labels == idx)

# How many ice cream images did we find?
correct = conf[idx, idx]

# The recall is then the true positives divided by
# the true positives + false negatives
recall = correct / total_expected
print(recall)

What are the worst predictions?

The confusion matrix and precision-recall report can already give hints about things you can do to improve the model. There are other useful things you can do. You’ve already seen that the cake category is the worst overall. It can also be enlightening to look at images that were predicted wrongly but that have very high confidence scores. These are the “most wrong” predictions. Why is the model so confident, yet so wrong about these images?

# Find for which images the predicted class is wrong
wrong_images = np.where(predicted_labels != target_labels)[0]

# For every prediction, find the largest probability value;
# this is the probability of the winning class for this image
probs_max = np.max(probabilities, axis=-1)

# Sort the probabilities from the wrong images from low to high
idx = np.argsort(probs_max[wrong_images])

# Reverse the order (high to low), and keep the 5 highest ones
idx = idx[::-1][:5]

# Get the indices of the images with the worst predictions
worst_predictions = wrong_images[idx]

index2class = {v:k for k,v in test_generator.class_indices.items()}

for i in worst_predictions:
    print("%s was predicted as '%s' %.4f" % (
        test_generator.filenames[i],
        index2class[predicted_labels[i]],
        probs_max[i]
    ))
strawberry/09d140146c09b309.jpg was predicted as 'salad' 0.9999
apple/671292276d92cee4.jpg was predicted as 'pineapple' 0.9907
muffin/3b25998aac3f7ab4.jpg was predicted as 'cake' 0.9899
pineapple/0eebf86343d79a23.jpg was predicted as 'banana' 0.9897
cake/bc41ce28fc883cd5.jpg was predicted as 'waffle' 0.9885
from keras.preprocessing import image
img = image.load_img(test_data_dir +
        test_generator.filenames[worst_predictions[0]])
plt.imshow(img)
The worst prediction... or is it?
Qge vibfv tsaredraad... uh uq uw?

A note on imbalanced classes

There is much more to say about image classifiers than we have room for in this book. One topic that comes up a lot is how to deal with imbalanced data.

Converting to Core ML

When you write model.save("name.h5") or use the ModelCheckpoint callback, Keras saves the model in its own format, HDF5. In order to use this model from Core ML, you have to convert it to a .mlmodel file first. For this, you’ll need to use the coremltools Python package.

pip install -U coremltools
import coremltools
labels = ["apple", "banana", "cake", "candy", "carrot",
          "cookie", "doughnut", "grape", "hot dog",
          "ice cream", "juice", "muffin", "orange",
          "pineapple", "popcorn", "pretzel", "salad",
          "strawberry", "waffle", "watermelon"]
coreml_model = coremltools.converters.keras.convert(
    best_model,
    input_names="image",
    image_input_names="image",
    output_names="labelProbability",
    predicted_feature_name="label",
    red_bias=-1,
    green_bias=-1,
    blue_bias=-1,
    image_scale=2/255.0,
    class_labels=labels)
coreml_model.author = "Your Name Here"
coreml_model.license = "Public Domain"
coreml_model.short_description = "Image classifier for 20 different types of snacks"

coreml_model.input_description["image"] = "Input image"
coreml_model.output_description["labelProbability"]= "Prediction probabilities"
coreml_model.output_description["label"]= "Class label of top prediction"
coreml_model.save("MultiSnacks.mlmodel")
Your very own Core ML model
Tiug qusy owl Fuka WZ limew

Challenges

Challenge 1: Train using MobileNet

Train the binary classifier using MobileNet and see how the score compares to the Turi Create model. The easiest way to do this is to copy all the images for the healthy categories into a folder called healthy and all the unhealthy images into a folder called unhealthy. (Or maybe you could train a “foods I don’t like” vs. “foods I like” classifier.)

Challenge 2: Add more layers

Try adding more layers to the top model. You could add a Conv2D layer, like so:

top_model.add(Conv2D(num_filters, 3, padding="same"))
top_model.add(BatchNormalization())
top_model.add(Activation("relu"))
**Tip**: To add a `Conv2D` layer after the `GlobalAveragePooling2D` layer, you have to add a `Reshape` layer in between, because global pooling turns the tensor into a vector, while `Conv2D` layers want a tensor with three dimensions.
top_model.add(GlobalAveragePooling2D())
top_model.add(Reshape((1, 1, 1024)))
top_model.add(Conv2D(...))

Challenge 3: Experiment with optimizers

In this chapter and the last you’ve used the Adam optimizer, but Keras offers a selection of different optimizers. Adam generally gives good results and is fast, but you may want to play with some of the other optimizers, such as RMSprop and SGD. You’ll need to experiment with what learning rates work well for these optimizers.

Challenge 4: Train using MobileNetV2

There is a version 2 of MobileNet, also available in Keras. MobileNet V2 is smaller and more powerful than V1. Just like ResNet50, it uses so-called residual connections, an advanced way to connect different layers together. Try training the classifier using MobileNetV2 from the keras.applications.mobilenetv2 module.

Challenge 5: Train MobileNet from scratch

Try training MobileNet from scratch on the snacks dataset. You’ve seen that transfer learning and fine-tuning works very well, but only because MobileNet has been pre-trained on a large dataset of millions of photos. To create an “empty” MobileNet, use weights=None instead of weights="imagenet". You’ll find that it’s actually quite difficult to train a large neural network from scratch on such a small dataset. See whether you can get this model to learn anything, and, if so, what sort of accuracy it achieves on the test set.

Challenge 6: Fully train the model

Once you’ve established a set of hyperparameters that works well for your machine learning task, it’s smart to combine the training set and validation set into one big dataset and train the model on the full thing. You don’t really need the validation set anymore at this point — you already know that this combination of hyperparameters will work well — and so you might as well train on these images too. After all, every extra bit of training data helps! Try it out and see how well the model scores on the test set now. (Of course, you still shouldn’t train on the test data.)

Key points

Have a technical question? Want to report a bug? You can ask questions and report bugs to the book authors in our official book forum here.
© 2025 Kodeco Inc.

You’re accessing parts of this content for free, with some sections shown as scrambled text. Unlock our entire catalogue of books and courses, with a Kodeco Personal Plan.

Unlock now