Configuring Paperspace ML-in-a-Box Machine to run a deeplearning-showcase

Introduction

Paperspace is a cloud provider that specialises in GPU enabled machines. It has a set of publicly available images including the ML-in-a-Box machines. These machines can be provisioned on multiple types of hardware, for example P4000, P5000, and P6000 machines.
The ML-in-a-Box image comes preconfigured with Keras, Tensorflow, Tensorboard, Cuda, and other standard machine learning frameworks. Unfortunately, that does not mean it is ready to plug and play any deeplearning application.
In fact, while trying to run our own „deeplearning-showcase“ project we ran into many problems that needed some „not so easy“ solutions sometimes. In this blog post we will provide a walkthrough of how we provisioned the ML-in-a-Box machine and got our „deeplearning-showcase“ up and running.

Setting up the Paperspace account

The first step is to sign up to Paperspace and setting up a payment method. Head to the sign-up page on Paperspace and enter your details, then head to the payment methods page and click the „Add Card“ button. After filling up the details, you can add a PromoCode if you have one. Some PromoCodes are easily found online and will add 10$ to your account credit.

Request access to ML-in-a-Box Machines

The second step is to request access to the ML-in-a-Box images and the GPU enabled machines. Make sure to do that as early as possible as it takes some time to approve your request. Head to the console page and click the green „NEW MACHINE“ button, choose the server location (e.g. CA1, NY2, AMS1) then navigate to „Public Templates“. A pop-up window will appear with the title „Unavailable“ asking you for the reason you want to use this machine, close this window for now and proceed to choose the ML-in-a-Box Ubuntu 16.04 Templates, then choose one of the hardware specs available (P4000, P5000, P6000).

Once you did that, the same pop-up window from before will appear, read carefully and fill in the reason you want to have access for these machines and image Templates.

Starting the ML-in-a-Box machine

Some time after requesting access (could be up to 24 hours) you will receive an email stating that you now have access to the requested service. Proceed again to the console page and follow the following steps to provision your machine :

Go to the console page again and click the „NEW MACHINE“ button.
Choose server location, OS (Public Templates > ML-in-a-Box Ubuntu 16.04), and the hardware wanted (P5000 in our case).
Choose the cost calculation method ( hourly vs monthly).
Follow the Paperspace guide and choose all your other preferences, then click to Provision the new machine.
Navigate back to the console page again and wait for your machine to start, then hover over the machine info box and click „launch“.
Now you are in your machine.

Problems & Error Codes

After starting our Ml-in-a-Box instance we tried to run our „deeplearning-showcase“ using:

python3 train_DenseNet121.py

But we faced a number of issues described below. All the issues revolved around the compatibility between Tensorflow on one side, and CUDA & the machines CPU on the other.

After solving all these problems we were able to run our showcase with 96% GPU utilisation and 97% GPU Memory utilisation. The details about the deeplearning showcase will be available in another blog post.

Problem 1 : Unable to execute Tensorflow Callbacks

Problem Description:

While Tensorflow is trying to execute the callbacks, it errors out on a missing attribute in module ‚pandas.core.computation‘. the problem is having an older version of Dask.
see this Github Issue for more details.

Stacktrace:

File „train_DenseNet121.py“, line 50, in <module>
main()
File „train_DenseNet121.py“, line 30, in main
validation=dnn.valid_batches)
File „/home/paperspace/Desktop/deeplearning-showcase/model.py“, line 116, in train
write_graph=False)
File „/home/paperspace/Desktop/deeplearning-showcase/model.py“, line 305, in __init__
super(TrainValTensorBoard, self).__init__(training_log_dir, **kwargs)
File „/home/paperspace/anaconda3/lib/python3.6/site-packages/keras/callbacks.py“, line 709, in __init__
from tensorflow.contrib.tensorboard.plugins import projector
File „/home/paperspace/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/__init__.py“, line 22, in <module>
from tensorflow.contrib import bayesflow
File „/home/paperspace/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/bayesflow/__init__.py“, line 28, in <module>
from tensorflow.contrib.bayesflow.python.ops import layers
File „/home/paperspace/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/bayesflow/python/ops/layers.py“, line 26, in <module>
from tensorflow.contrib.bayesflow.python.ops.layers_dense_variational_impl import *
File „/home/paperspace/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/bayesflow/python/ops/layers_dense_variational_impl.py“, line 30, in <module>
from tensorflow.contrib.distributions.python.ops import deterministic as deterministic_lib
File „/home/paperspace/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/distributions/__init__.py“, line 37, in <module>
from tensorflow.contrib.distributions.python.ops.estimator import *
File „/home/paperspace/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/distributions/python/ops/estimator.py“, line 21, in <module>
from tensorflow.contrib.learn.python.learn.estimators.head import _compute_weighted_loss
File „/home/paperspace/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/__init__.py“, line 92, in <module>
from tensorflow.contrib.learn.python.learn import *
File „/home/paperspace/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/__init__.py“, line 23, in <module>
from tensorflow.contrib.learn.python.learn import *
File „/home/paperspace/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/__init__.py“, line 25, in <module>
from tensorflow.contrib.learn.python.learn import estimators
File „/home/paperspace/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/estimators/__init__.py“, line 297, in <module>
from tensorflow.contrib.learn.python.learn.estimators.dnn import DNNClassifier
File „/home/paperspace/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/estimators/dnn.py“, line 30, in <module>
from tensorflow.contrib.learn.python.learn.estimators import dnn_linear_combined
File „/home/paperspace/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/estimators/dnn_linear_combined.py“, line 31, in <module>
from tensorflow.contrib.learn.python.learn.estimators import estimator
File „/home/paperspace/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py“, line 48, in <module>
from tensorflow.contrib.learn.python.learn.learn_io import data_feeder
File „/home/paperspace/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/learn_io/__init__.py“, line 21, in <module>
from tensorflow.contrib.learn.python.learn.learn_io.dask_io import extract_dask_data
File „/home/paperspace/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/learn_io/dask_io.py“, line 26, in <module>
import dask.dataframe as dd
File „/home/paperspace/anaconda3/lib/python3.6/site-packages/dask/dataframe/__init__.py“, line 3, in <module>
from .core import (DataFrame, Series, Index, _Frame, map_partitions,
File „/home/paperspace/anaconda3/lib/python3.6/site-packages/dask/dataframe/core.py“, line 40, in <module>
pd.core.computation.expressions.set_use_numexpr(False)
AttributeError: module ‚pandas.core.computation‘ has no attribute ‚expressions‘

Solution:

Upgrade to latest version of Dask:

 pip install dask --upgrade

Note: At the time of writing this post the latest version of Dask is 0.18.2. After running the above command you should be able to see the following text as a conformation of Dask installation:

Successfully installed dask-0.18.2

Problem 2: Tensorflow unable to run AVX instructions on CPU

Problem Description:

The Machine has an older CPU that does not support AVX instructions. Following is a quote from official Tensorflow release notes:

Breaking Changes
Prebuilt binaries will use AVX instructions. This may break TF on older CPUs.

For more details please take a look at this Stackoverflow Post and this GitHub issue.

Stacktrace:

running the training using „python3 train_DenseNet121.py“ will print out the following line:

Illegal instruction (core dumped)

It could be that no more details or stacktrace is visible.

Solution:

revert TensorFlow and the TensorFlow GPU from versions 1.8 to versions 1.5

pip3 uninstall tensorflow 
pip3 uninstall tensorflow-gpu 
pip3 install tensorflow==1.5 
pip3 install tensorflow-gpu==1.5

To test the installation run the following commands to print out the current tensorflow and tensorflow-gpu versions

python3 -c 'import tensorflow as tf; print(tf.__version__)'
python3 -c 'import tensorflow-gpu as tf-gpu; print(tf-gpu.__version__)'

Problem 3: Unable to run Tensorflow 1.8 with Cuda 9.1

Problem Description:

TensorFlow and TesnorFlow GPU versions 1.5 and later are not compatible with Cuda 9.1.
looking at TensorFlow Release Notes again shows that the binaries are built against Cuda 9.0

Breaking Changes
Prebuilt binaries are now built against CUDA 9.0 and cuDNN 7.

Stacktrace:

ImportError: Could not find ‚cudart64_90.dll‘. TensorFlow requires that this DLL be installed in a directory that is named in your %PATH% environment variable. Download and install CUDA 9.0 from this URL: https://developer.nvidia.com/cuda-toolkit

Solution:

revert back to Cuda 9.0 following this guide.

Uninstall old version of CUDA Toolkit:

 sudo apt-get purge cuda 
 sudo apt-get purge cuda 
 sudo apt-get purge libcudnn6 
 sudo apt-get purge libcudnn6-dev

Install CUDA Toolkit 9.0 and cuDNN 7.0:

 wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_9.0.176-1_amd64.deb 
 wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/libcudnn7_7.0.5.15-1+cuda9.0_amd64.deb 
 wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/libcudnn7-dev_7.0.5.15-1+cuda9.0_amd64.deb 
 wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/libnccl2_2.1.4-1+cuda9.0_amd64.deb 
 wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/libnccl-dev_2.1.4-1+cuda9.0_amd64.deb 
 sudo dpkg -i cuda-repo-ubuntu1604_9.0.176-1_amd64.deb 
 sudo dpkg -i libcudnn7_7.0.5.15-1+cuda9.0_amd64.deb 
 sudo dpkg -i libcudnn7-dev_7.0.5.15-1+cuda9.0_amd64.deb 
 sudo dpkg -i libnccl2_2.1.4-1+cuda9.0_amd64.deb 
 sudo dpkg -i libnccl-dev_2.1.4-1+cuda9.0_amd64.deb 
 sudo apt-get update 
 sudo apt-get install cuda=9.0.176-1 
 sudo apt-get install libcudnn7-dev 
 sudo apt-get install libnccl-dev

Reboot the system to load the NVIDIA drivers:
```
 reboot 
```

Set up PATH and LD_LIBRARY_PATH variables. You can also add them to the end of .bashrc file:

 export PATH=/usr/local/cuda-9.0/bin${PATH:+:${PATH}} 
 export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

Summary and Conclusion

Paperspace provides a good argument to be considered as a GPU Cloud Machines provider, but their ML-in-a-Box instances are not Plug & Play ready. However, investing some effort and time in these machines can yield into a very efficient environment to run DeepLearning applications. This blog post serves as a collection of information and resources to try and reduce the setup time needed for anyone considering to do DeepLearning on Paperspaces‘ ML-in-a-Box instances.

image source

Title image

Ähnliche Beiträge