How To Install Distributed Tensorflow on GCP and Perform Chaos Engineering Experiments

Last Updated: May 16, 2018

Introduction

Gremlin is a simple, safe and secure tool for performance of Chaos Engineering experiments to improve system resilience.

This tutorial shows how to:

Create a distributed TensorFlow cluster on GCP
Install Gremlin on Ubuntu 16.04
Perform a Chaos Engineering experiment on distributed TensorFlow

Chaos Engineering Hypothesis

You will run TensorFlow on a cluster of 4 nodes. While training is running, you will perform a Chaos Engineering experiment using Gremlin.

The experiment will shut down one of the TensorFlow nodes during the training of the model, simulating unplanned reboot/maintenance windows, health-check failures, and auto-shutoff.

The result will be that the worker node will be removed from the training cluster. Training will continue because you have created a tensorflow cluster which handles instance shutdown.

Prerequisites

Before you begin this tutorial, you will need:

A Gremlin account (sign up here)
A GCP account
Google Cloud Shell

Step 1 - Create a new project and register for Cloud Machine Learning Engine and Compute Engine API in Google Cloud Platform

First you will create a new project. Log in to the Google Cloud youb interface, navigate to the Google Cloud Resource Manager, and create a project called TensorFlow. Enable billing for your project. (At time of this writing, GCP will give you $300 credit within your first 12 months of use.)

Next, you will set up Google Cloud Shell for your project, replacing tensorflow-distributed-204203 with your project ID.

Open Google Cloud Shell and run the following:

bash

1gcloud config set compute/zone us-youst1-c
2gcloud config set project tensorflow-distributed-204203
3gcloud auth application-default login
4
5Do you want to continue (Y/n)?  Y

Step 2 - Create the first GCP compute instance for your distributed TensorFlow cluster

In this step you will create an instance and install Python, TensorFlow, and Gremlin. This instance will later be used to create a base image for the cluster.

Run the following in Google Cloud Shell:

bash

1gcloud compute instances create template-instance \
2--image-project ubuntu-os-cloud \
3--image-family ubuntu-1604-lts \
4--boot-disk-size 10GB \
5--machine-type n1-standard-2

When asked if you would like to enable Google APIs, type y for yes.

Wait until the instance status is RUNNING before moving on:

bash

1NAME               ZONE         MACHINE_TYPE    PREEMPTIBLE  INTERNAL_IP  . EXTERNAL_IP    STATUS
2template-instance  us-youst1-c   n1-standard-2                 10.138.0.2    35.197.52.22   RUNNING

SSH to the instance using Google Cloud Shell:

bash

1gcloud compute ssh template-instance

On the instance, install Python, pip, and TensorFlow:

bash

1sudo apt-get update
2sudo apt-get -y upgrade
3sudo apt-get install -y python-pip python-dev
4sudo pip install tensorflow

Step 3 - Installing the Gremlin Daemon and CLI

First, ssh into your server and add the Gremlin Debian repository:

1echo "deb https://deb.gremlin.com/ release non-free" | sudo tee /etc/apt/sources.list.d/gremlin.list

Import the repo’s GPG key:

1sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys C81FC2F43A48B25808F9583BDFF170F324D41134 9CDB294B29A5B1E2E00C24C022E8EF3461A50EF6

Then install the Gremlin daemon and CLI:

1sudo apt-get update && sudo apt-get install -y gremlind gremlin

After you have created your Gremlin account (sign up here) you will need to find your Gremlin Daemon credentials. Login to the Gremlin App using your Company name and sign-on credentials. These were emailed to you when you signed up to start using Gremlin.

Navigate to Team Settings and click on your Team.

Store your Gremlin agent credentials as environment variables, for example:

1export GREMLIN_TEAM_ID=3f242793-018a-5ad5-9211-fb958f8dc084

1export GREMLIN_TEAM_SECRET=eac3a31b-4a6f-6778-1bdb813a6fdc

Step 5 - Create a GCP Cloud Storage Bucket

The Hello World of machine learning is MNIST. MNIST is a computer vision dataset with images of handwritten digits. You can train a model to look at these images and predict what digits they are.

Start a new Google Cloud Shell, click the + button.

Create a Google Cloud Storage bucket to store your MNIST files ($RANDOM will generate a random number):

bash

1MNIST_BUCKET="mnist-$RANDOM"
2gsutil mb -c regional -l us-youst1 gs://${MNIST_BUCKET}

Next you will clone the following repo created by the Google Cloud Platform team as a demo for model training:

bash

1git clone https://github.com/GoogleCloudPlatform/cloudml-dist-mnist-example
2cd cloudml-dist-mnist-example

Use the following script to download the MNIST data files and copy them to your Cloud Storage bucket:

bash

1sudo ./scripts/create_records.py
2gsutil cp /tmp/data/train.tfrecords gs://${MNIST_BUCKET}/data/
3gsutil cp /tmp/data/test.tfrecords gs://${MNIST_BUCKET}/data/

Step 6 - Create the TensorFlow base image that will be used to create additional instances

First, turn off auto-delete for the template-instance VM to preserve its disk before you delete it:

bash

1gcloud compute instances set-disk-auto-delete template-instance \
2--disk template-instance --no-auto-delete

Then delete template-instance:

bash

1gcloud compute instances delete template-instance

Next, create an image called template-image from the template-instance disk:

bash

1gcloud compute images create template-image \
2--source-disk template-instance

Wait until the status is READY before progressing to the next step:

bash

1NAME            PROJECT                         FAMILY  DEPRECATED  STATUS
2template-image  tensorflow-distributed-204203                       READY

Step 7 - Create additional TensorFlow cluster nodes

Now you will create a TensorFlow distributed cluster using template-image. Create four instances named master-0, worker-0, worker-1, and ps-0.

bash

1gcloud compute instances create \
2master-0 worker-0 worker-1 ps-0 \
3--image template-image \
4--machine-type n1-standard-2 \
5--scopes=default,storage-rw

You are using 4 n1-standard-2 instances so you don’t exceed GCP’s free trial quota. This will create machines with 2 virtual CPUs and 7.5GB of memory.

You are now ready to run distributed TensorFlow.

Now, view your running instances:

bash

1gcloud compute instances list

You will see the following result:

bash

1NAME        ZONE        MACHINE_TYPE   PREEMPTIBLE  INTERNAL_IP  EXTERNAL_IP     STATUS
2master-0    us-youst1-c  n1-standard-2               10.138.0.3   35.233.232.66   RUNNING
3worker-0    us-youst1-c  n1-standard-2               10.138.0.2   35.199.187.171  RUNNING
4worker-1    us-youst1-c  n1-standard-2               10.138.0.5   35.230.98.162   RUNNING
5ps-0        us-youst1-c  n1-standard-2               10.138.0.4   35.230.3.243    RUNNING

Step 8 - Run a Chaos Engineering experiment using Gremlin before your TensorFlow model training is running

You will run a shutdown Chaos Engineering experiment on our distributed TensorFlow cluster using the Gremlin Control Panel.

Run a shutdown Chaos Engineering experiment using the Gremlin Control Panel.

Select Create Attack in the Gremlin Control Panel. Select “State” and then “Shutdown” in the dropdown menu.

The Shutdown State Attack will consume CPU resources based on the settings you select. Before you can run the Gremlin attack you will need to click either Exact hosts to run the attack on or click the Random attack option.

Click Exact and select a Tensorflow client in the list.

Your attack will begin to run, you will be able to view its progress via Gremlin Attacks in the Gremlin Control Panel.

Chaos Engineering Results

If you were running TensorFlow on only one node you would now not be able to continue with our model training until you resolve this experiment. The distributed Tensorflow MNIST training script you will run in the next step will shutdown and restart each instance in the distributed TensorFlow cluster before training starts. It will also save the training data at checkpoints.

This demonstrates the need for the creation of a distributed Tensorflow cluster. Continuous Chaos Engineering will give you the confidence that when a machine shuts down your training will continue. You can scheduled a Gremlin Shutdown to occur on a regular basis in the Gremlin Control Panel.

Step 9 - Set up Auto Scaling to Improve Cluster Reliability

Google Cloud lets you create instance groups for auto scaling.

When you set up instance groups and enable auto scaling, GCP automatically replaces downed instances for you.

Step 10 - Run Distributed TensorFlow code and start training your model

Next you will run a script that trains your MNIST model across the distributed TensorFlow cluster. This will take a few minutes to run. When it is finished, you will be able to use your model for predictions.

Run the following command from the cloudml-dist-mnist-example directory:

bash

1./scripts/start-training.sh gs://${MNIST_BUCKET}

This script pushes the code to each instance and sends the necessary parameters to start the TensorFlow process on each machine to join the distributed cluster.

When the training is done, the script prints the location of the newly generated model files:

bash

1INFO:tensorflow:Finished evaluation at 2018-05-15-00:20:38INFO:tensorflow:Saving dict for global step 10005: accuracy = 0.9929, global_step = 10005, loss = 0.033285327
2Trained model is stored in gs://mnist-25648/job_180514_233407/export/Servo/1526343638/

Copy the path of your Cloud Storage bucket path for use in later steps.

Step 11 - Publish your model for predictions

For this step, you will use the Google Cloud Bucket path from the previous step and replace gs://${MNISTBUCKET}/job[TIMESTAMP]/export/Servo/[JOB_ID] with your Google Cloud bucket path.

For example:

bash

1MODEL="MNIST"
2MODEL_BUCKET=gs://mnist-25648/job_180514_233407/export/Servo/1526343638/

Next you will create a new v1 version of your model and point it to the model files in your Google Cloud bucket.

bash

1gcloud ml-engine models create ${MODEL} --regions us-central1
2gcloud ml-engine versions create \
3 --origin=${MODEL_BUCKET} --model=${MODEL} v1

Set the default version of your model to v1:

bash

1gcloud ml-engine versions set-default --model=${MODEL} v1

Your model is now running with Google Cloud ML and is able to produce predictions.

Step 12 - Execute predictions with Cloud Datalab

First you will create a Cloud Datalab instance to test your MNIST model predictions.

Create a Cloud Datalab instance in East1. You’re are using East1 so as not to exceed the CPU quota for the Google Cloud free tier in East1:

bash

1gcloud config set compute/zone us-east1-c
2datalab create mnist-datalab --no-create-repository

Next launch the Cloud Datalab notebook by clicking the Google Cloud Shell button:

Then click Change Port, enter 8081, and then click Change and Preview:

In the Cloud Datalab application, create a new notebook by clicking the +Notebook icon in the upper right. Paste the following text into the first cell of the new notebook:

bash

1%%bash
2wget https://raw.githubusercontent.com/GoogleCloudPlatform/cloudml-dist-mnist-example/master/notebooks/Online%20prediction%20example.ipynb
3cat Online\ prediction\ example.ipynb > Untitled\ Notebook.ipynb

Click Run at the top of the page to download the online prediction example.ipynb notebook.

Refresh the page to load the new notebook content. Then select the first cell containing the JavaScript code and click Run to execute it.

Scroll down the page until you see the number drawing panel, and draw a number with your cursor.

Click in the next cell to activate it and then click the down arrow next to the Run button at the top and select Run from this Cell.

The prediction outputs a length-10 array: each index is a number (0–9) you might have drawn, and each value is the probability (0.0–1.0) that you drew that number. For example, if you drew a 6—and drew it well!—then output[6] should dwarf output[5] (unless your 6 looks a lot like a 5). It should really dwarf all the other values in the array, since, for example, 1 and 7 look much less like 6 than 5 does.

Step 13 - Create a backup Google Cloud Coldline Bucket for your model

First you will create a backup Google Cloud bucket for your model:

bash

1gsutil mb -c coldline -l us-youst1 gs://tensorflow-backup-bucket

Next you will use gsutil to replicate the bucket contents for your TensorFlow model to your backup bucket:

bash

1gsutil rsync -d -r gs://mnist-25648/job_180514_233407/export/Servo/1526343638/ gs://tensorflow-backup-bucketBuilding synchronization state

Conclusion

You have installed a distributed TensorFlow environment running the MNIST model and run a Chaos Engineering experiment on that cluster with Gremlin. Now you’re ready to explore other Gremlin Attacks.

Explore the Gremlin Community for more information on how to use Chaos Engineering with your application infrastructure. Meet engineers practicing Chaos Engineering in the Chaos Engineering Slack.

Start

How to run an experiment on AWS Lambda using Failure Flags and Node.js

Introduction In this tutorial, we'll show you how to run a Chaos Engineering experiment on a serverless application…

Andre Newman

Sr. Reliability Specialist

Start

How to run multiple experiments in parallel using Gremlin

Introduction Gremlin lets you run multiple Chaos Engineering experiments in a single workflow called a Scenario…

Andre Newman

Sr. Reliability Specialist

Start

How to use your Gremlin reliability score in Jenkins to ensure reliable releases

Introduction Adding Gremlin to your CI/CD pipeline is a key step in automating your reliability efforts. We previously…

Andre Newman

Sr. Reliability Specialist

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

Get started

Introduction

Chaos Engineering Hypothesis

Prerequisites

Step 1 - Create a new project and register for Cloud Machine Learning Engine and Compute Engine API in Google Cloud Platform

Step 2 - Create the first GCP compute instance for your distributed TensorFlow cluster

Step 3 - Installing the Gremlin Daemon and CLI

Step 5 - Create a GCP Cloud Storage Bucket

Step 6 - Create the TensorFlow base image that will be used to create additional instances

Step 7 - Create additional TensorFlow cluster nodes

Step 8 - Run a Chaos Engineering experiment using Gremlin before your TensorFlow model training is running

Chaos Engineering Results

Step 9 - Set up Auto Scaling to Improve Cluster Reliability

Step 10 - Run Distributed TensorFlow code and start training your model

Step 11 - Publish your model for predictions

Step 12 - Execute predictions with Cloud Datalab

Step 13 - Create a backup Google Cloud Coldline Bucket for your model

Conclusion

Related

Avoid downtime. Use Gremlin to turn failure into resilience.