Training MNIST using MXNet and Keras on Amazon EKS

This document explains how to build a MNIST model using MXNet and Keras on Amazon EKS.

This documents assumes that you have an EKS cluster available and running. Make sure to have a GPU-enabled Amazon EKS cluster ready.

MNIST Training using MXNet on EKS

In this sample, we'll use MNIST database of handwritten digits and train the model to recognize any handwritten digit.

You can use a pre-built Docker image rgaut/deeplearning-mxnet:with_mnist_cnn_gpu. This image uses 763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-training:1.4.0-gpu-py27-cu90-ubuntu16.04 as the base image. It comes bundled with MXNet. It also has training code and downloads training and test data sets.

Alternatively, you can build a Docker image using the Dockerfile in samples/mnist/training/mxnet/Dockerfile.
```
docker image build samples/mnist/training/mxnet/ -t <tag_for_image>
```
This will create a Docker image that will have all the utilities to run MNIST.
Create a pod that will use this Docker image and run the MNIST training:
```
kubectl create -f samples/mnist/training/mxnet/mxnet.yaml
```

Check status of the pod:

kubectl get pods -l app=mxnet
NAME        READY   STATUS      RESTARTS   AGE
mxnet-mnist   0/1     Completed   0          6m

Check the progress in training:

kubectl logs mxnet-mnist

Using MXNet backend
Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz

   16384/11490434 [..............................] - ETA: 0s
   24576/11490434 [..............................] - ETA: 35s
   57344/11490434 [..............................] - ETA: 30s
  122880/11490434 [..............................] - ETA: 21s
  303104/11490434 [..............................] - ETA: 11s
  581632/11490434 [>.............................] - ETA: 7s 
 1187840/11490434 [==>...........................] - ETA: 3s
 2375680/11490434 [=====>........................] - ETA: 2s
 3948544/11490434 [=========>....................] - ETA: 1s
 5521408/11490434 [=============>................] - ETA: 0s
 7094272/11490434 [=================>............] - ETA: 0s
 8683520/11490434 [=====================>........] - ETA: 0s
10256384/11490434 [=========================>....] - ETA: 0s
11493376/11490434 [==============================] - 1s 0us/step

11501568/11490434 [==============================] - 1s 0us/step
/usr/local/lib/python2.7/dist-packages/keras/backend/mxnet_backend.py:96: UserWarning: MXNet Backend performs best with `channels_first` format. Using `channels_last` will significantly reduce performance due to the Transpose operations. For performance improvement, please use this API`keras.utils.to_channels_first(x_input)`to transform `channels_last` data to `channels_first` format and also please change the `image_data_format` in `keras.json` to `channels_first`.Note: `x_input` is a Numpy tensor or a list of Numpy tensorRefer to: https://github.com/awslabs/keras-apache-mxnet/tree/master/docs/mxnet_backend/performance_guide.md
  train_symbol = func(*args, **kwargs)

. . .

[23:25:30] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/12

  128/60000 [..............................] - ETA: 15:12 - loss: 2.3015 - acc: 0.1094
  384/60000 [..............................] - ETA: 5:15 - loss: 2.2646 - acc: 0.1667 
  640/60000 [..............................] - ETA: 3:14 - loss: 2.2128 - acc: 0.2437
  896/60000 [..............................] - ETA: 2:22 - loss: 2.1461 - acc: 0.2824
 1152/60000 [..............................] - ETA: 1:53 - loss: 2.0702 - acc: 0.3229
 1408/60000 [..............................] - ETA: 1:34 - loss: 1.9679 - acc: 0.3629
 1664/60000 [..............................] - ETA: 1:22 - loss: 1.8818 - acc: 0.3930
 1920/60000 [..............................] - ETA: 1:12 - loss: 1.8086 - acc: 0.4104
 2176/60000 [>.............................] - ETA: 1:05 - loss: 1.7239 - acc: 0.4370
. . .
59776/60000 [============================>.] - ETA: 0s - loss: 0.0398 - acc: 0.9882
60000/60000 [==============================] - 14s 232us/step - loss: 0.0398 - acc: 0.9882 - val_loss: 0.0262 - val_acc: 0.9904
Test loss: 0.026189500172245608
Test accuracy: 0.9904
MXNet Backend: Successfully exported the model as MXNet model!
MXNet symbol file -  mnist_cnn-symbol.json
MXNet params file -  mnist_cnn-0000.params

. . .

Model input data_names and data_shapes are: 
data_names :  ['/conv2d_1_input1']
data_shapes :  [DataDesc[/conv2d_1_input1,(128L, 28L, 28L, 1L),float32,NCHW]]

. . .

Note: In the above data_shapes, the first dimension represent the batch_size used for model training. 
You can change the batch_size for binding the module based on your inference batch_size.

Complete detailed logs.

A copy of the model is also saved at samples/mnist/training/mxnet/saved_model.

What happened?

Runs python /tmp/mnist_cnn.py command (specified in the Dockerfile and available at samples/mnist/training/mxnet/mnist_cnn.py)
- Downloads MNIST training and test data set from S3.
  - Each set has images and labels that identify the image
- Performs supervised learning
  - Run 12 epochs using the training data with the specified parameters
  - For each epoch
    - Reads the training data
    - Builds the training model using the specified algorithm
    - Feeds the test data and matches with the expected output
    - Reports the accuracy, expected to improve with each run
  - Exports the trained model in /mnist_model directory at a worker node. The model consists of mnist_cnn-0000.params and mnist_cnn-symbol.json files. These are needed for inference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mxnet.md

mxnet.md

Training MNIST using MXNet and Keras on Amazon EKS

MNIST Training using MXNet on EKS

What happened?

Files

mxnet.md

Latest commit

History

mxnet.md

File metadata and controls

Training MNIST using MXNet and Keras on Amazon EKS

MNIST Training using MXNet on EKS

What happened?