Training MNIST using MXNet and Keras on Amazon EKS

This document explains how to build a MNIST model using MXNet and Keras on Amazon EKS.

This documents assumes that you have an EKS cluster available and running. Make sure to have a GPU-enabled Amazon EKS cluster ready.

MNIST Training using MXNet on EKS

In this sample, we'll use MNIST database of handwritten digits and train the model to recognize any handwritten digit.

  1. You can use a pre-built Docker image rgaut/deeplearning-mxnet:with_mnist_cnn_gpu. This image uses as the base image. It comes bundled with MXNet. It also has training code and downloads training and test data sets.

    Alternatively, you can build a Docker image using the Dockerfile in samples/mnist/training/mxnet/Dockerfile.

    docker image build samples/mnist/training/mxnet/ -t <tag_for_image>

    This will create a Docker image that will have all the utilities to run MNIST.

  2. Create a pod that will use this Docker image and run the MNIST training:

    kubectl create -f samples/mnist/training/mxnet/mxnet.yaml
  3. Check status of the pod:

    kubectl get pods -l app=mxnet
    mxnet-mnist   0/1     Completed   0          6m
  4. Check the progress in training:

    kubectl logs mxnet-mnist
    Using MXNet backend
    Downloading data from
       16384/11490434 [..............................] - ETA: 0s
       24576/11490434 [..............................] - ETA: 35s
       57344/11490434 [..............................] - ETA: 30s
      122880/11490434 [..............................] - ETA: 21s
      303104/11490434 [..............................] - ETA: 11s
      581632/11490434 [>.............................] - ETA: 7s 
     1187840/11490434 [==>...........................] - ETA: 3s
     2375680/11490434 [=====>........................] - ETA: 2s
     3948544/11490434 [=========>....................] - ETA: 1s
     5521408/11490434 [=============>................] - ETA: 0s
     7094272/11490434 [=================>............] - ETA: 0s
     8683520/11490434 [=====================>........] - ETA: 0s
    10256384/11490434 [=========================>....] - ETA: 0s
    11493376/11490434 [==============================] - 1s 0us/step
    11501568/11490434 [==============================] - 1s 0us/step
    /usr/local/lib/python2.7/dist-packages/keras/backend/ UserWarning: MXNet Backend performs best with `channels_first` format. Using `channels_last` will significantly reduce performance due to the Transpose operations. For performance improvement, please use this API`keras.utils.to_channels_first(x_input)`to transform `channels_last` data to `channels_first` format and also please change the `image_data_format` in `keras.json` to `channels_first`.Note: `x_input` is a Numpy tensor or a list of Numpy tensorRefer to:
      train_symbol = func(*args, **kwargs)
    . . .
    [23:25:30] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
    x_train shape: (60000, 28, 28, 1)
    60000 train samples
    10000 test samples
    Train on 60000 samples, validate on 10000 samples
    Epoch 1/12
      128/60000 [..............................] - ETA: 15:12 - loss: 2.3015 - acc: 0.1094
      384/60000 [..............................] - ETA: 5:15 - loss: 2.2646 - acc: 0.1667 
      640/60000 [..............................] - ETA: 3:14 - loss: 2.2128 - acc: 0.2437
      896/60000 [..............................] - ETA: 2:22 - loss: 2.1461 - acc: 0.2824
     1152/60000 [..............................] - ETA: 1:53 - loss: 2.0702 - acc: 0.3229
     1408/60000 [..............................] - ETA: 1:34 - loss: 1.9679 - acc: 0.3629
     1664/60000 [..............................] - ETA: 1:22 - loss: 1.8818 - acc: 0.3930
     1920/60000 [..............................] - ETA: 1:12 - loss: 1.8086 - acc: 0.4104
     2176/60000 [>.............................] - ETA: 1:05 - loss: 1.7239 - acc: 0.4370
    . . .
    59776/60000 [============================>.] - ETA: 0s - loss: 0.0398 - acc: 0.9882
    60000/60000 [==============================] - 14s 232us/step - loss: 0.0398 - acc: 0.9882 - val_loss: 0.0262 - val_acc: 0.9904
    Test loss: 0.026189500172245608
    Test accuracy: 0.9904
    MXNet Backend: Successfully exported the model as MXNet model!
    MXNet symbol file -  mnist_cnn-symbol.json
    MXNet params file -  mnist_cnn-0000.params
    . . .
    Model input data_names and data_shapes are: 
    data_names :  ['/conv2d_1_input1']
    data_shapes :  [DataDesc[/conv2d_1_input1,(128L, 28L, 28L, 1L),float32,NCHW]]
    . . .
    Note: In the above data_shapes, the first dimension represent the batch_size used for model training. 
    You can change the batch_size for binding the module based on your inference batch_size.

    Complete detailed logs.

    A copy of the model is also saved at samples/mnist/training/mxnet/saved_model.

What happened?

  • Runs python /tmp/ command (specified in the Dockerfile and available at samples/mnist/training/mxnet/
    • Downloads MNIST training and test data set from S3.
      • Each set has images and labels that identify the image
    • Performs supervised learning
      • Run 12 epochs using the training data with the specified parameters
      • For each epoch
        • Reads the training data
        • Builds the training model using the specified algorithm
        • Feeds the test data and matches with the expected output
        • Reports the accuracy, expected to improve with each run
      • Exports the trained model in /mnist_model directory at a worker node. The model consists of mnist_cnn-0000.params and mnist_cnn-symbol.json files. These are needed for inference.