Cloud HPC¶
UM-Bridge provides a kubernetes-based solution for running any UM-Bridge model container on cloud platforms at HPC scale.
These instructions show how to use the UM-Bridge kubernetes configuration. It assumes that you have a kubernetes cluster available, for example by following our insructions for setting up Google Kubernetes Engine (GKE).
Step 1: Clone UM-Bridge¶
First, clone the UM-Bridge repository by running:
git clone https://github.com/UM-Bridge/umbridge.git
In the cloned repository, navigate to the folder kubernetes
. It contains all that is needed in the following.
Step 2: Set up load balancer¶
First, retrieve HAProxy:
helm repo add haproxytech https://haproxytech.github.io/helm-charts
helm repo update
helm install kubernetes-ingress haproxytech/kubernetes-ingress \
--create-namespace \
--namespace \
haproxy-controller \
--set controller.service.type=LoadBalancer \
--set controller.replicaCount=1 \
--set defaultBackend.replicaCount=1 \
--set controller.logging.level=debug \
--set controller.ingressClass=haproxy
Then, start HAProxy with the configuration provided by UM-Bridge:
kubectl apply -f setup/svc-ingress.yml
Step 3: Run model instances¶
To start model instances, run:
kubectl create -f model.yaml
You can check the status of the model instances by running:
kubectl get pods
They can be deleted again via:
kubectl delete -f model.yaml
The default model.yaml
may be adjusted to run your own model by changing:
image
: The docker image containing a model with UM-Bridge support. Any image provided by the UM-Bridge project will work right away.replicas
: The number of model instances to run.limits
/requests
: The CPU and memory resources your model should receive. Keeplimits
equal torequests
to ensure that your model has exclusive access to a fixed amount of resources.env
: Environment variables that should be passed to the model.
Step 4: Calling the model¶
The model instances are now available through your load balancer’s IP address, which you can determine from:
kubectl get services --namespace=haproxy-controller
The model instances may be accessed from any UM-Bridge client, and up to replicas
requests will be handled in parallel.
Multinode MPI on Cloud HPC¶
The instructions above work for any UM-Bridge model container, even ones that are MPI parallel. However, a single container is naturally limited to a single physical node. In order to parallelize across nodes (and therefore across containers) via MPI, the additional steps below are needed.
Step 1: mpi-operator base image¶
The multinode MPI configuration makes use of the mpi-operator from kubeflow. This implies that the mode base image has to be constructed via one of the following base images, depending on MPI implementation:
mpioperator/openmpi-builder
mpioperator/intel-builder
When separating between builder and final image, the corresponding base images may be used for the latter:
mpioperator/openmpi
mpioperator/intel
Step 2: Deploy mpi-operator¶
In addition to choosing a suitable base image for the model, the mpi-òperator needs to be deployed on the cluster:
kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/master/deploy/v2beta1/mpi-operator.yaml
Step 3: Setting up NFS¶
The multinode MPI setup mounts a shared (NFS) file system on the /shared
directory of your model container, replicating a traditional HPC setup. The NFS server is set up via:
kubectl apply -f setup/nfs.yaml
Note: This assumes a disk ‘gce-nfs-disk’ to be set up in GCE!
In order to finish the setup we need the IP address. We can get this from
kubectl describe service
Change setup/nfs-pv-pvc.yaml
to the IP address you just retrieved, and (if needed) adjust storage capacity to the attached disk.
Then run:
kubectl apply -f setup/nfs-pv-pvc.yaml
Step 4: Running a job on the new cluster¶
The job configuration is located in multinode-mpi-model.yaml
. It is largely analogous to model.yaml
, except that both launcher and worker containers are configured. The relevant additional config options are:
slotsPerWorker
: The number of MPI ranks per worker container.mpiImplementation
: By default set to OpenMPI, but can be changed toIntel
.command
: The launcher is expected to run the UM-Bridge server, which then should call mpirun for model evaluations. Workers should only execute the (pre-defined)sshd
command in order to listen for requests from the launcher.replicas
: Expected to be 1 for the launcher, and the number of model instances you want to run for the workers.
The multinode-mpi-model.yaml
file describes a single launcher with a number of workers assigned to it. In order to run multiple jobs (each a launcher and multiple MPI parallel workers), run the following script:
bash launch-multinode-mpi-model.sh
It creates a number of jobs from the multinode-mpi-model.yaml
file, each time substituting JOB_INDEX
for a unique index.
Just as before, access to the model instances is now available via the load balancer’s IP address, as described in Step 4 of the single-node setup.
All jobs can be shut down via:
kubectl delete MPIJob --all