Train and predict with Cloud ML
- Install the census package
- Train the model locally
- View training results
- Label your run
- Train the model in Cloud ML
- Compare runs
- Improve model accuracy
- Scale up
- Hyperparameter tuning
- Deploy a model
- Use a deployed model to make predictions
- Next steps
In this tutorial, we’ll train a model with Google’s Cloud Machine Learning Engine (Cloud ML) and use the trained model to make prediction.
This tutorial is based on Cloud ML Engine - Getting Started and follows similar steps.
This tutorial uses Cloud ML to train models, which will incur
changes to your Google Cloud account. While these charges will be
nominal (under $5) if you’re concerned about the cost of training
in Cloud ML, feel free to skip any of the
This tutorial assumes the following:
- Guild AI is installed and verified
- Google Cloud SDK is installed (see verification step below)
- Your virtual environment is activated (if applicable)
- You have a working Internet connection
While not required, we recommend using a dedicated virtual environment for this tutorial. To setup your environment, see Tutorial virtual environments.
To verify you have the Google Cloud SDK
gcloud program installed and
configured, run the following:
gcloud ml-engine jobs list
If the command reports an error, resolve the issue before proceeding.
Install the census package
We’ll be training a model in the
cloudml.census package. From a
command line, run:
guild install cloudml.census
You can list the model operations available from the package by running:
guild ops cloudml.census
We’ll run each of these operations over the course of this tutorial.
Train the model locally
Before we jump into training in Cloud ML, let’s train our model locally. This will confirm that we can get a reasonable result with the data we’re using.
Train the model locally using 1,000 steps:
guild train census train-steps=1000
Guild lets you review the flag values for the operation before
ENTER to accept the values.
With only 1,000 training steps, the model should be trained on most systems in under a minute.
After the run has finished, list it by running:
The runs command is shorthand for runs list,
which shows your runs. If you trained other models you’ll see those
runs in the list as well. If you’d like to limit your listing to runs
associated with the
cloudml-census model, use:
guild runs -o census
This command shows run that have an operation containing the string
“census”. Feel free to experiment with other filters to show only the
runs you’re interested in. Try
guild runs --help for a
list of filter options.
View training results
The training run generates a number of files. Let’s view them by running:
guild runs info --files
runs info is used to view information for a run. By
default the command shows information for the latest run. The
--files option tells Guild to include file information in the
Here’s a breakdown of the files associated with our training run:
||Census data used for training and test|
||Latest TensorFlow checkpoint marker|
||TensorFlow event log generated during evaluation|
||TensorFlow event log generated during training|
||TensorFlow checkpoint at training step 1|
||TensorFlow checkpoint at training step 1000|
||Link to training scripts|
Run files are stored locally on your system. To view the full path to
run files, use the
guild runs info --files --full-path
This can be useful if you want to refer to a file from the command line (e.g. to copy it, open it, etc.)
Next we’ll use Guild View to view our training results. In a separate console, run:
This opens a browser window showing you the list of runs for the census classifier. You can see the list of files for a run by clicking the FILES tab.
If you have runs from other models, you can filter the runs that Guild
View displays using the
-o option (short for
limit runs for operations containing
guild view -o census
The view command will not terminate until you stop it by
Ctrl-C. To continue running Guild commands, run Guild
View from a separate console.
From Guild View, you can open TensorBoard by clicking in the upper left of the window.
In TensorBoard, note the model accuracy, which can be viewed under the SCALARS tab. The value for our first run should be approximately 80%.
You may keep the Guild View and TensorBoard windows open for the
remainder of this tutorial — they’ll automatically refresh as you
generate runs. When you no longer need them, close the browser windows
and stop Guild View by typing
Label your run
Over time you’ll generate a lot of runs and it’s helpful to label them for easy identification. Guild provides runs label, which associates a run with a simple string.
Let’s label our run to indicate that it was local and used 1,000 training steps:
guild runs label local-1000
Guild will confirm that you want to label the latest run — press
ENTER to apply the label.
You can view the new label when you list runs:
You can filter runs using their label by running
guild runs -l
LABEL is a label or part of a label to filter
on. For example, assuming we consistently label our local runs
using the convention
local-NNNN, we could filter them using
guild runs -l local.
Train the model in Cloud ML
Now that we’ve trained the model locally, let’s train it in Cloud ML!
When training in Cloud ML, we need a Cloud Storage bucket that you have write access to. The bucket is used to store the Cloud ML jobs.
For instructions on creating and configuring a bucket, see Set up your Cloud Storage bucket in the Cloud ML Getting Started guide.
We’ll use the name of your Cloud Storage bucket throughout this tutorial so let’s create a variable for it. Run this command to set the name of your bucket:
echo -n "Google Cloud Storage bucket name: " && read BUCKET
Verify that you can use
gsutil to write to the bucket by running:
echo 'It works!' | gsutil cp - gs://$BUCKET/guild_test \ && gsutil cat gs://$BUCKET/guild_test
If you have write access to
$BUCKET you should see:
When you’ve confirmed that you can write to the Cloud Storage bucket, train the census classifier in Cloud ML by running:
guild run census:cloudml-train \ bucket=$BUCKET \ train-steps=1000 \ --label cloud-1000
Review the operation flag values and press
ENTER to begin training.
cloudml-train operation is otherwise identical to the
operation, but it’s run remotely on Cloud ML rather than locally. The
bucket to know where to create the Cloud ML job.
Note that we included a label for our run using the
option. We’ll use this convention throughout the tutorial so we can
easily identity our runs. In this case, we’re indicating that the run
is cloud based and uses 1,000 steps.
Earlier we used
guild train census to train our model. The
train command is an alias for the
train operation, so our previous command was equivalent to running
guild run census:train. The
cloudml-train operation doesn’t
have an alias, so we have to use the run command in this
The remote operation will take longer to run because Cloud ML first provisions a job and waits for TensorFlow to start. This can take up to several minutes. You can view the run status as it progresses in the console.
You can also view the run status from Guild View and TensorBoard, which will both reflect the latest information.
As the Cloud ML job progresses, Guild automatically synchronizes with it. This includes downloading generated files such as TensorFlow event logs and saved models as well as updating run status.
If Guild becomes disconnected from a remote job — for example,
you terminate the command by typing
Ctrl-C or lose network
connectivity — the job will continue to run in Cloud ML. You can
synchronize with the job by running
guild sync. If you’d like
to reconnect to a running job, run
guild sync --watch. For
more information, see the sync command.
When the run has finished, view the list of runs:
You should see that the run
cloud-1000 has completed.
Each time you train a new model, you’ll want to compare the results to those of previous runs. Guild’s compare command can be used to quickly sort and compare training accuracies and loss.
Compare your runs by running:
This command starts Guild Compare, which is a spreadsheet-like application that displays run details. You can view run status, time, accuracy, loss, and hyperparameters. We’ll learn more about hyperparameters later in this tutorial.
You can use the arrow keys to navigate to cells that you can’t see.
In Guild Compare, note the accuracy for the two runs — they should be similar. We’d expect this because we trained the same model with the same flags and data. The only difference between the runs is where the training occurred!
To exit Guild Compare, press
As with the view command, compare will not exit
until you explicitly stop it. If you want to keep it running,
start the command in a separate console. Press
(refresh) whenever you want to update the display with the latest
For a complete list of commands supported by Guild Compare, press
? while running Guild Compare.
If you’d like to export the comparison data in CSV format run:
guild compare --csv
To save the comparison to a file, use your shell’s output redirection:
guild compare --csv > census-runs.csv
Use TensorBoard to compare runs
Let’s now compare run results in TensorBoard. If you have TensorBoard open from the previous run, it will automatically refresh to display the current training results. If you need to restart Guild View, run:
In Guild View, click VIEW IN TENSORBOARD to open TensorBoard.
You may alternatively open TensorBoard directly using the tensorboard command:
In TensorBoard, it helps to make two changes when viewing results for census training in this tutorial:
- In the left sidebar, uncheck Ignore outliers in chart scaling
- In the left sidebar, use the slider to set Smoothing to
You can also maximize the accuracy scalar by clicking under the chart.
The values for accuracy in TensorBoard correspond to the values displayed in Guild Compare. TensorBoard shows all available scalars for selected runs, including their values at various stages of training.
Improve model accuracy
Let’s try to improve the accuracy of the model with more training.
To increase the training steps, set the
train-steps flag to a higher
level. We’ll use 10,000 steps this time:
guild run census:cloudml-train \ bucket=$BUCKET \ train-steps=10000 \ --label cloud-10000
While this is a 10x increase, the training time Cloud ML should not be much longer than our previous cloud run.
The census model trains quickly even with 10,000 steps. The time
spent in Cloud ML for fast training models like
primarily the setup and teardown of the Cloud ML job.
You can monitor the training progress using Guild View, Guild Compare or TensorBoard.
As the training proceeds, the model accuracy should surpass that of the previous two runs.
Let’s use Guild Compare to confirm. If Guild Compare is already
r to refresh the display. If it’s not running,
Note the accuracy of the newly trained model — it should be a few
percentage points higher than the other two! When you’re done
q to exit Guild Compare.
Let’s take advantage of Cloud ML’s ability to scale! We can train our
model using distributed TensorFlow and multiple workers by setting the
scale-tier flag to a multi-worker tier.
See Scale Tier for more information on Cloud ML’s supported environments.
By default, Cloud ML operations in Guild AI use the
tier, which is suitable for small models and experimentation. We can
alternatively use the
STANDARD_1 scale tier, which uses multiple
workers to train the model in parallel.
Train the model again using 10,000 steps, but this time on the
STANDARD_1 scale tier:
guild run census:cloudml-train \ bucket=$BUCKET \ scale-tier=STANDARD_1 \ train-steps=10000 \ --label scaled-10000
This operation will take a similar amount of time as the previous operation, despite being run on a higher scale tier. This is because the job setup and teardown overhead dominates the overall operation time.
We can use TensorBoard to view the time spent in training. If TensorBoard isn’t already running, you can start it directly by running the following in a separate console:
In TensorBoard, in the left sidebar under Horizontal Axis, click RELATIVE. This will plot scalars using their relative times along the horizontal axis. This is useful for viewing time spent actually training as it does not include job setup and teardown time.
Note the relative times between the last two runs. Our latest run, which used a higher scale tier, should be noticeably faster for the same number of steps.
Cloud ML provides a feature called hyperparameter tuning, which can optimize training results by automatically modifying model hyperparameters. A hyperparameter is a model setting that can be configured with flags. The census model we’re training has three tunable hyperparameters:
||Number of nodes in the first layer of the DNN (default is 100)|
||Number of layers in the DNN (default is 4)|
||How quickly the size of the layers in the DNN decay (default is 0.7)|
Up to this point we’ve training using the default values for each flag. But what if we could improve our model’s predictive accuracy by using different values?
Cloud ML hyperparameter tuning tries different values for us using an algorithm that attempts to optimize the training result. In this case we want to maximize predictive accuracy.
For more information see Overview of Hyperparameter Tuning in the Cloud ML Engine documentation.
Let’s see if we can improve model accuracy accuracy using Cloud ML
hyperparameter tuning. We can use our model’s
operation for this.
The command below trains the census model 6 times, each time using 10,000 steps. While this is still a relatively low cost training run, if you’re concerned about the cost of this operation, feel free to skip this command.
Start a hyperparameter tuning run with 6 trials of 10,000 training steps each by running:
guild run census:cloudml-hptune \ bucket=$BUCKET \ max-trials=6 \ train-steps=10000 \ --label tune-6x-10000
This uses the default tuning configuration
provided by the
cloudml.census package and changes the default
max-trials count to 6 from 4. The
max-trials flag determines how many
trials are run during the hyperparameter tuning operation.
Each trial in Cloud ML is synchronized as a new Guild run, which you can treat like any other run.
As hyperparameter tuning progresses, use Guild Compare to view the trial results. As each trial is completed, it will appear as a new run with its own accuracy and loss. Guild automatically labels trial runs so you can identify the tuning run that generated them.
To view trial run results while the tuning operation is running, start Guild Compare in a separate console by running:
As the trials are completed, press
r in Guild Compare to refresh
the display and view the trial results. You can compare the
differences in accuracy and loss to see what hyperparameter values
provide the best result.
cloudml-hptune itself will not have accuracy or loss. It’s
job is limited to starting training run trials, which do have
accuracy and loss.
In Guild Compare, you can sort any column in numeric descending
order by moving the cursor to the column and pressing
is useful for sorting runs by accuracy — the most accuracy model
will appear at the top of the list. You can sort in ascending
order by pressing
!. This is useful for sorting runs by loss.
Note that you must re-order a column after you refresh the display.
Wait for the hyperparameter tuning operation to finish. In the next section we’ll deploy the trial that has the highest accuracy.
Deploy a model
Now that we’ve tuned our hyperparameters to find an optimal model (at least within the six trials that we ran) it’s time to deploy a model!
Using Guild Compare, select the run with the highest accuracy. You may
sort runs by accuracy by moving the cursor to the accuracy column
1. This will reorder the runs, displaying the most
accurate run first. Note its run ID (from the run column on the
far left) – we’ll use this ID for deployment.
Exit Guild Compare by pressing
Define a variable
DEPLOY_RUN with the run ID you selected:
echo -n "Run ID to deploy: " && read DEPLOY_RUN
When prompted, enter the run ID you’d like to deploy (e.g. the run with the highest accuracy).
When specifying run IDs in Guild, you don’t have to provide the entire run ID — you may use a run ID prefix as long as it’s unique. Usually the first few characters is enough to identify a run.
Verify that the specified run is the one you’d like to deploy. You can
confirm the accuracy for
DEPLOY_RUN by running:
guild compare $DEPLOY_RUN --table
The table should contain a single run. If it contains a run other than the one you want to deploy, re-enter the run ID using the step above.
When you’re ready, deploy the model for your selected run using the
guild run census:cloudml-deploy run=$DEPLOY_RUN bucket=$BUCKET
This will create a model in Cloud ML named
census_dnn if one
doesn’t already exist. This name is generated using the deployed model
census-dnn. Cloud ML doesn’t allow certain characters in
model names (e.g.
-) and Guild replaces these with an underscore
_). If you want to specify a different model name, use the
model flag when running the
After ensuring that a Cloud ML model exists, Guild creates a model
version. This deploys the trained model to Cloud ML. By default,
Guild uses a version containing the deployment run ID. If you want to
specify a different version, use the
version flag when running the
For more information on model deployment in Cloud ML, see:
Verify the deployed model and its version by running:
gcloud ml-engine versions list --model census_dnn
You should see the newly deployed
census_dnn model version.
Guild does not provide a complete interface for managing deployed Cloud ML models and versions. For a details on Cloud ML models and versions, see gcloud ml-engine command line reference.
List your current runs:
Note the latest run — it will be for the
and have a label that contains the deployed model run ID. Deployments
are like any other operation in Guild.
Let’s examine the
guild runs info
Note the following attributes in the output:
||Name of the deployed Cloud ML model|
||Version of the deployed Cloud ML model|
||Google Cloud Storage location of the deployed model binaries|
||Run ID of the deployed trained model|
This information can be used to verify a deployment and is retained as an artifact of the deploy operation. In the next section, we’ll see how this information is used to run predictions.
Use a deployed model to make predictions
In our final tutorial segment, we’ll use our recently deployed model to make predictions!
Cloud ML makes predictions by running the inference operation of deployed model with new data. Data are provided as one or more instances, each corresponding to a prediction that will be made using the trained model. You may provide instances as JSON or plain text format.
For details on how Cloud ML makes predictions, see Prediction Basics in Cloud ML concepts.
Cloud ML can make predictions in two ways: online and batch. Online predictions are made immediately and returned in the operation result. Batch predictions are submitted to Cloud ML as a job that runs in the background, similar to training. When a batched prediction job is completed, predictions are written to a file in the job directory.
In general, online prediction is faster if you have a small number of predictions to make while batch processing is faster when you need to make a large number of predictions. The specific performance characteristics will vary across models. Refer to Online prediction versus batch prediction in Cloud ML Prediction Basics for more information.
We’ll use both techniques to see how each works.
First, let’s run an online prediction:
guild run census:cloudml-predict
Guild will prompt you before running the operation. Press
run the operation.
The prediction is made using the latest deployed version of the
census-dnn model (
census_dnn in Cloud ML). The model package
provides sample inputs to make the prediction. You can specify your
own using the
To experiment with your own inputs, download the census prediction
and make changes accordingly. Assuming your samples is named
census-inputs.json and is located in the current directory, you can
cloudml-predict operation using:
guild run census:cloudml-predict instances=census-inputs.json
For online prediction, the results are printed to the console. However, they are also saved as a file in the run directory. Let’s take a look!
If Guild View isn’t already running, start it in a separate console by running:
In Guild View, select the latest run — it should be a
cloudml-predict operation — and click the FILES tab. The run
will have two files:
||The inputs used in the prediction|
||The prediction results|
prediction.results to see how the instances were
classified by the deployed model. The values correspond to the inputs
prediction.inputs. Click the NEXT button to view the
Guild View can be used to view the contents of some files: text files, images, and audio. Look for a grey button background on the file name, which can be clicked to open the file.
As you can see, the
cloudml-predict operation generates a run like
other operations. In this case the run contains information about the
model version used to make the prediction along with the prediction
inputs and outputs.
In the previous section we performed an online prediction that returned the result immediately. In this section we’ll use batch prediction. Batch prediction runs on Cloud ML as a job.
We use the
cloudml-batch-predict operation to start a prediction
job. Start a batch prediction job by running:
guild run census:cloudml-batch-predict bucket=$BUCKET
This will use the sample prediction inputs as before. We need to
bucket to tell Guild where to create the job.
The job will take some time to complete. When finished, it will save
generated a file named
prediction.results-00000-of-00001, which you
can view in Guild View, or inspect from the command line.
When the operation has finished, view the generated files by running:
guild runs info --files --full-path
--full-path option is used to show the full path to
prediction.results-00000-of-00001, which you can use to copy or view
You may use online or batch predictions as needed to use your deployed models in Cloud ML!
In this tutorial we worked with Google’s Cloud Machine Leaning Engine (Cloud ML) to train and deploy a classifier.
We covered a number of topics:
Train a model locally to sanity-check our results and then train in Cloud ML to scale up
Use hyperparameter tuning in Cloud ML to optimize our predictive accuracy with a set of hyperparameter values
Evaluate runs using Guild View, Guild Compare, and TensorBoard
Select an optimized model and deploy it to Cloud ML
Use a model deployed in Cloud ML to make predictions using new data
Guild AI models and their operations encapsulate the complexities of this process, letting you focus on the high level workflow and get your work done faster!
Over the course of this tutorial we generated a number of runs and Cloud ML jobs. If you no longer need these, you may delete them to free up resources.
Delete unneeded Cloud ML jobs
gsutil program to delete any jobs from your Cloud Storage
bucket that you don’t need. To delete all Guild related files, run:
gsutil rm -r gs://$BUCKET/guild_*
Delete unneeded Cloud ML model versions
gcloud program to delete any model versions you don’t need.
First, list the model versions:
gcloud ml-engine versions list --model census_dnn
For each version that you want to delete, run:
gcloud ml-engine versions delete --model census_dnn VERSION
VERSION is the model version you want to delete.
If you no longer need the model, you may delete it provided there are no versions associated with it:
gcloud ml-engine models delete census_dnn
Delete Guild runs
If you ran the tutorials from a virtual environment, you can simply delete the virtual environment directory, which will free up all disk space used by steps from this tutorial.
Only delete virtual environments when you’re certain you no longer need them.
If you want to only delete runs associated with the
model, which was used in this tutorial, run:
guild runs delete -o cloudml-census -p
-p option indicates that the delete should be
permanent. This ensures that the runs no longer consume disk
space. Omit this option if you want to retain the ability to restore
them at a later time. Note that disk space will not be freed up for
these jobs until you permanently delete them (see the purge
Finally, if you deleted any runs without using the
-p option and
you want to free up disk space consumed by them, you can can
permanently delete them by running:
guild runs purge
Review the list of runs carefully before permanently deleting them!
- Read Cloud ML Engine - Getting Started to view the same workflow using lower-level Cloud ML commands