Start training AI job by using GCP ai platform from scratch
Google GCP has two important platforms for AI jobs. One is ai-platform which is the traditional platform, another is the new one called Vertex AI, which is a SaaS platform, where you can make AI jobs auto-finished by AutoML.
If you want to have more controls over the training processes such as algorithm and cost, you would choose ai-platform. Vertex is smart, but it's too expensive.
First let us start up an GCP VM instance by terraform. It requires more space on memory, so I choose "e2-medium" instance. Here is my main.cf for terraform.
terraform {
required_providers {
google = {
source = "hashicorp/google"
version = "3.5.0"
}
}
}
provider "google" {
credentials = file("/home/xxx/terraform/gcp/hostcache.json")
project = "blissful-canyon-xxx"
region = "us-west1"
zone = "us-west1-b"
}
resource "google_compute_instance" "vm_instance" {
name = "terraform-instance"
machine_type = "e2-medium"
boot_disk {
initialize_params {
image = "ubuntu-os-cloud/ubuntu-2004-lts"
}
}
network_interface {
network = "default"
access_config {
}
}
}
Next, run the following commands to validate the configuration, startup the instance and check the results.
$ terraform validate
$ terraform apply
$ terraform show
After then you can log into the instance via gcloud CLI which is pre-installed in your own PC.
$ gcloud compute ssh terraform-instance
Welcome to Ubuntu 20.04.5 LTS (GNU/Linux 5.15.0-1025-gcp x86_64)
I have got an instance whth 2 AMD cores and 4 GB memory, as the following info.
Basic System Information:
---------------------------------
Uptime : 0 days, 0 hours, 20 minutes
Processor : AMD EPYC 7B12
CPU cores : 2 @ 2249.998 MHz
AES-NI : ✔ Enabled
VM-x/AMD-V : ❌ Disabled
RAM : 3.8 GiB
Swap : 0.0 KiB
Disk : 9.6 GiB
Distro : Ubuntu 20.04.5 LTS
Kernel : 5.15.0-1025-gcp
After login, run the following commands to update the system.
$ sudo apt update
$ sudo apt upgrade
We are using ubuntu 20.04 OS which has python3 installed by default. And we want to install pip3 as well.
$ sudo apt install python3-pip
Now, let's install tensorflow and other packages for machine learning.
$ sudo pip3 install tensorflow
$ sudo pip3 install pandas
$ sudo pip3 install scikit-learn
$ sudo pip3 install google-cloud-storage
$ sudo apt install graphviz
$ sudo pip3 install pydot
In order to submit training jobs to ai-platform, we need to install gcloud CLI in this new created instance. Please see the following doc for how to install it.
After gcloud CLI is installed, run the following command to make it get authorized by GCP.
$ gcloud init
And, run the following command to make client libraries get authorized by GCP too.
$ gcloud auth application-default login
You will see the output like,
Credentials saved to file: [/home/xxx/.config/gcloud/application_default_credentials.json]
These credentials will be used by any library that requests Application Default Credentials (ADC).
Quota project "blissful-canyon-xxx" was added to ADC which can be used by Google client libraries for billing and quota. Note that some services may still bill the project owning the resource.
You have to create a storage bucket in GCP Cloud Storage. I have created that one in their console with the name "mljobs".
Next, download the dataset and upload it to cloud storage. See follows.
$ wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
$ mv iris.data iris.csv
$ gsutil cp iris.csv gs://mljobs
Create the project dir "iris" and put a source file in the subdir "src".
$ mkdir -p iris/src
$ cd iris/src
$ vi train.py
What's the code in train.py? They are just an instance for SVM algorithm implementation.
import datetime
import os
import subprocess
import sys
import pandas as pd
from sklearn import svm
import joblib
from google.cloud import storage
import sklearn
print('sklearn: {}'.format(sklearn.__version__))
# Create a Cloud Storage client to download the data a nd upload the model
storage_client = storage.Client()
# Download the data
public_bucket = storage_client.bucket('mljobs')
blob = public_bucket.blob('iris.csv')
blob.download_to_filename('iris.csv')
#Read the training data from the file
iris_data = pd.read_csv('./iris.csv',sep=',',names=["sepal_length", "sepal_width", "petal_length","petal_width","species"])
#Assigning the classes and removing the target variable
iris_label = iris_data.pop('species')
#We're going to be using the SVC (support vector classifier) SVM (support vector machine)
classifier = svm.SVC(gamma='auto')
#Training the model
classifier.fit(iris_data, iris_label)
#Saving the data locally
model_filename = 'model.joblib'
joblib.dump(classifier, model_filename)
# Create a Cloud Storage client to upload the model
bucket = storage_client.bucket('mljobs')
blob = bucket.blob(model_filename)
blob.upload_from_filename(model_filename)
Now run the script train.py in local to check if it has any errors.
$ python3 train.py
sklearn: 1.2.0
This script generates a model file named "model.joblib" which is put into cloud storage bucket.
In the source dir, remove the new generated files and create an empty file "__init__.py".
$ rm -f iris.csv model.joblib
$ touch __init__.py
Next, go back to project dir "iris", create the bash script "submit.sh" whose content are follows.
#!/bin/bash
gcloud ai-platform jobs submit training iris_0001 \
--module-name=src.train \
--package-path=./src \
--staging-bucket=gs://mljobs \
--region=$(gcloud config get compute/region) \
--scale-tier=CUSTOM \
--master-machine-type=n1-standard-8 \
--python-version=3.7 \
--runtime-version=2.8
Please note you have to use tensorflow 2.8 and python 3.7 as versions here. Just because ai-platform doesn't support the higher versions of them yet.
Run this script and you will see the output as follows.
$ ./submit.sh
Job [iris_0001] submitted successfully.
Your job is still active. You may view the status of your job with the command
$ gcloud ai-platform jobs describe iris_0001
or continue streaming the logs with the command
$ gcloud ai-platform jobs stream-logs iris_0001
jobId: iris_0001
state: QUEUED
It says that you have submitted the training job successfully. You can run the commands in the output to check the job's status and logs. For instance, my query to the stream logs shows,
$ gcloud ai-platform jobs stream-logs iris_0001
INFO 2022-12-19 03:01:44 +0000 service Validating job requirements...
INFO 2022-12-19 03:01:44 +0000 service Job creation request has been successfully validated.
INFO 2022-12-19 03:01:44 +0000 service Waiting for job to be provisioned.
INFO 2022-12-19 03:01:44 +0000 service Job iris_0001 is queued.
INFO 2022-12-19 03:01:48 +0000 service Waiting for training program to start.
NOTICE 2022-12-19 03:02:19 +0000 master-replica-0.gcsfuse Opening GCS connection...
NOTICE 2022-12-19 03:02:19 +0000 master-replica-0.gcsfuse Mounting file system "gcsfuse"...
NOTICE 2022-12-19 03:02:19 +0000 master-replica-0.gcsfuse File system has been successfully mounted.
INFO 2022-12-19 03:03:56 +0000 master-replica-0 Running task with arguments: --cluster={"chief": ["127.0.0.1:2222"]} --task={"type": "chief", "index": 0} --job={ "scale_tier": "CUSTOM", "master_type": "n1-standard-8", "package_uris": ["gs://mljobs/iris_0001/4bb82b2c99386ecd36cb90e0281f24d6d419843e5fb3cb98c5b00ac0d92dfcb5/src-0.0.0.tar.gz"], "python_module": "src.train", "region": "us-west1", "runtime_version": "2.8", "run_on_raw_vm": true, "python_version": "3.7"}
INFO 2022-12-19 03:04:05 +0000 master-replica-0 Running module src.train.
INFO 2022-12-19 03:04:05 +0000 master-replica-0 Downloading the package: gs://mljobs/iris_0001/4bb82b2c99386ecd36cb90e0281f24d6d419843e5fb3cb98c5b00ac0d92dfcb5/src-0.0.0.tar.gz
INFO 2022-12-19 03:04:05 +0000 master-replica-0 Running command: gsutil -q cp gs://mljobs/iris_0001/4bb82b2c99386ecd36cb90e0281f24d6d419843e5fb3cb98c5b00ac0d92dfcb5/src-0.0.0.tar.gz src-0.0.0.tar.gz
...
INFO 2022-12-19 03:04:12 +0000 master-replica-0 Running command: python3 -m src.train
INFO 2022-12-19 03:04:14 +0000 master-replica-0 sklearn: 1.0.2
INFO 2022-12-19 03:04:14 +0000 master-replica-0 Module completed; cleaning up.
INFO 2022-12-19 03:04:14 +0000 master-replica-0 Clean up finished.
INFO 2022-12-19 03:04:14 +0000 master-replica-0 Task completed successfully.
And check the job's status,
$ gcloud ai-platform jobs describe iris_0001
createTime: '2022-12-19T03:01:44Z'
endTime: '2022-12-19T03:10:03Z'
etag: 4gDNLPs20qk=
jobId: iris_0001
jobPosition: '0'
startTime: '2022-12-19T03:04:31Z'
state: SUCCEEDED
trainingInput:
masterType: n1-standard-8
packageUris:
- gs://mljobs/iris_0001/4bb82b2c99386ecd36cb90e0281f24d6d419843e5fb3cb98c5b00ac0d92dfcb5/src-0.0.0.tar.gz
pythonModule: src.train
pythonVersion: '3.7'
region: us-west1
runtimeVersion: '2.8'
scaleTier: CUSTOM
trainingOutput:
consumedMLUnits: 0.13
And you can check the job's running results in web console.
You have the training job done well by using GCP ai-platform now.
Also you can use GCP Vertex AI which is much simpler to train an AI job. You even don't need to write one line of code by using their AutoML capability. But Vertex is more expensive than ai-platform, since you can't customize and optimize your code, and you can't control the training time, hardware specs etc.