This document explains how to build a MNIST model using MXNet and Keras on Amazon EKS.
This documents assumes that you have an EKS cluster available and running. Make sure to have a GPU-enabled Amazon EKS cluster ready.
In this sample, we'll use MNIST database of handwritten digits and train the model to recognize any handwritten digit.
-
You can use a pre-built Docker image
rgaut/deeplearning-mxnet:with_mnist_cnn_gpu
. This image uses763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-training:1.4.0-gpu-py27-cu90-ubuntu16.04
as the base image. It comes bundled with MXNet. It also has training code and downloads training and test data sets.Alternatively, you can build a Docker image using the Dockerfile in
samples/mnist/training/mxnet/Dockerfile
.docker image build samples/mnist/training/mxnet/ -t <tag_for_image>
This will create a Docker image that will have all the utilities to run MNIST.
-
Create a pod that will use this Docker image and run the MNIST training:
kubectl create -f samples/mnist/training/mxnet/mxnet.yaml
-
Check status of the pod:
kubectl get pods -l app=mxnet NAME READY STATUS RESTARTS AGE mxnet-mnist 0/1 Completed 0 6m
-
Check the progress in training:
kubectl logs mxnet-mnist Using MXNet backend Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz 16384/11490434 [..............................] - ETA: 0s 24576/11490434 [..............................] - ETA: 35s 57344/11490434 [..............................] - ETA: 30s 122880/11490434 [..............................] - ETA: 21s 303104/11490434 [..............................] - ETA: 11s 581632/11490434 [>.............................] - ETA: 7s 1187840/11490434 [==>...........................] - ETA: 3s 2375680/11490434 [=====>........................] - ETA: 2s 3948544/11490434 [=========>....................] - ETA: 1s 5521408/11490434 [=============>................] - ETA: 0s 7094272/11490434 [=================>............] - ETA: 0s 8683520/11490434 [=====================>........] - ETA: 0s 10256384/11490434 [=========================>....] - ETA: 0s 11493376/11490434 [==============================] - 1s 0us/step 11501568/11490434 [==============================] - 1s 0us/step /usr/local/lib/python2.7/dist-packages/keras/backend/mxnet_backend.py:96: UserWarning: MXNet Backend performs best with `channels_first` format. Using `channels_last` will significantly reduce performance due to the Transpose operations. For performance improvement, please use this API`keras.utils.to_channels_first(x_input)`to transform `channels_last` data to `channels_first` format and also please change the `image_data_format` in `keras.json` to `channels_first`.Note: `x_input` is a Numpy tensor or a list of Numpy tensorRefer to: https://github.com/awslabs/keras-apache-mxnet/tree/master/docs/mxnet_backend/performance_guide.md train_symbol = func(*args, **kwargs) . . . [23:25:30] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable) x_train shape: (60000, 28, 28, 1) 60000 train samples 10000 test samples Train on 60000 samples, validate on 10000 samples Epoch 1/12 128/60000 [..............................] - ETA: 15:12 - loss: 2.3015 - acc: 0.1094 384/60000 [..............................] - ETA: 5:15 - loss: 2.2646 - acc: 0.1667 640/60000 [..............................] - ETA: 3:14 - loss: 2.2128 - acc: 0.2437 896/60000 [..............................] - ETA: 2:22 - loss: 2.1461 - acc: 0.2824 1152/60000 [..............................] - ETA: 1:53 - loss: 2.0702 - acc: 0.3229 1408/60000 [..............................] - ETA: 1:34 - loss: 1.9679 - acc: 0.3629 1664/60000 [..............................] - ETA: 1:22 - loss: 1.8818 - acc: 0.3930 1920/60000 [..............................] - ETA: 1:12 - loss: 1.8086 - acc: 0.4104 2176/60000 [>.............................] - ETA: 1:05 - loss: 1.7239 - acc: 0.4370 . . . 59776/60000 [============================>.] - ETA: 0s - loss: 0.0398 - acc: 0.9882 60000/60000 [==============================] - 14s 232us/step - loss: 0.0398 - acc: 0.9882 - val_loss: 0.0262 - val_acc: 0.9904 Test loss: 0.026189500172245608 Test accuracy: 0.9904 MXNet Backend: Successfully exported the model as MXNet model! MXNet symbol file - mnist_cnn-symbol.json MXNet params file - mnist_cnn-0000.params . . . Model input data_names and data_shapes are: data_names : ['/conv2d_1_input1'] data_shapes : [DataDesc[/conv2d_1_input1,(128L, 28L, 28L, 1L),float32,NCHW]] . . . Note: In the above data_shapes, the first dimension represent the batch_size used for model training. You can change the batch_size for binding the module based on your inference batch_size.
Complete detailed logs.
A copy of the model is also saved at
samples/mnist/training/mxnet/saved_model
.
- Runs
python /tmp/mnist_cnn.py
command (specified in the Dockerfile and available at samples/mnist/training/mxnet/mnist_cnn.py)- Downloads MNIST training and test data set from S3.
- Each set has images and labels that identify the image
- Performs supervised learning
- Run 12 epochs using the training data with the specified parameters
- For each epoch
- Reads the training data
- Builds the training model using the specified algorithm
- Feeds the test data and matches with the expected output
- Reports the accuracy, expected to improve with each run
- Exports the trained model in
/mnist_model
directory at a worker node. The model consists ofmnist_cnn-0000.params
andmnist_cnn-symbol.json
files. These are needed for inference.
- Downloads MNIST training and test data set from S3.