by Nicolas Guinoiseau | 3.1.2024
Machine learning models are used in all sorts of applications and services to solve problems that are too difficult to solve with traditional computing techniques (image classification, pattern recognition, etc.). At the early stages of a project using pre-trained machine learning models is a quick way to get started. While this practice can help you figure out the problem to be solved, you often discover that your problem is beyond the basics and therefore requires more attention. Cloud Services such as Amazon SageMaker and Microsoft Azure Machine Learning offer platforms that include end-to-end machine learning pipelines, data transformation, and model training with related APIs. However, these services can prove costly and inflexible for the long run.
For the past two years, I have worked on a service that visualizes and predicts different outcomes from companies financial data, runs on a public cloud and have actively used Argo Workflows to solve various problems. I will demonstrate how we use it in our local machine learning experiments and dwell on the why's and what's.
Argo Workflows is one of the Argo Projects, a collection of open source tools for Kubernetes. It is an "open source container-native workflow engine for orchestrating parallel jobs on Kubernetes" as they say themselves, and provide a good bullet-points overview here.
One of our favourites is "Argo Workflows puts a cloud-scale supercomputer at your fingertips!". Let that sink in for a bit.
Typically it includes the following stages:
You might want to use several hyperparameter sets, so let's add them! The training pipeline that we'll create on a high level looks like this:
Argo Workflows offers several ways to design multi-stage workflows: DAG, steps, or a combination of both. You can find a multitude of examples in the Argo Workflows repository. For our use-case, I found using steps more intuitive and using it made the logic easier to read.
There are two ways to pass data between stages in a workflow:
The artifact repository is used for persistent storage (training data, evaluation metrics, models, etc.) and supports a variety of solutions.
We used the excellent Argo Workflows - Quick Start tutorial to get started. If you want to follow along, you will need kubectl, argo, and a Kubernetes cluster of your choice. Argo Workflows requires a database and an artifact repository. We used the Quick-Start defaults Postgres & MinIO (left out for brevity).
Going further some familiarity with Kubernetes manifests is a plus!
The workflows-controller-configmap file offers a variety of configuration options. Here we configure:
In our example each stage is defined as a single-step WorkflowTemplate. They are an easy way to reuse logic in different contexts, for instance, in a future tuning or when continuously retraining models.
The training pipeline Workflow uses WorkflowTemplates via TemplateRefs. From here, there are several ways of linking it all together:
In the following sections, we define WorkflowTemplates, the Workflow that will run and look at at how they reference each other. We'll also glance at a real hunky-dory way of using parameters.
In our first step, there are no input arguments. We output raw training data as an artifact. In a real world example you would most likely fetch this data from an object storage or API and then pass it on.
The metadata.name will be referenced in the Workflow under the field spec.templates.steps[].templateRef. Please watch out for typos. ;)
Here is an example on how the hyper-params WorkflowTemplate can be written, we pass the different hyper-parameter values that we want to go through as environment variables.
In this case, we want the output parameter to be a list of json objects of all the possible hyper-parameters combinations.
The evaluate WorkflowTemplate takes as input parameter params which is one of the json objects previously outputted by the hyper-params step.
The syntax for passing parameters from one step to the other is shown in the Workflow section.
Let's add the WorkflowTemplate CRDs to Kubernetes so that our Workflow that we define next can use them: kubectl apply -k /path/to/kustomization.yaml.
All the WorkflowTemplates previously created are being used in each of the Workflow steps. You see them referenced in spec.templates[].steps[].templateRef.name, take note that the name of a step and the name of the WorkflowTemplate used in that step do not have to be the same. As mentioned earlier one WorkflowTemplate can be reused in any Workflows.
Three things about this Workflow:
First, if you are familiar with Kubernetes manifests, you will notice spec.templates.steps contains collections of collections. In this case the step get-data will start running first, then the steps transform and hyper-params will run in parallel once get-data has been completed successfully.
Second, the arguments of the steps here all refer to a previous output parameter or artifact.
Thirdly, for the step evaluate you will notice that its arguments.parameters[] is linked with a special field withParam. It takes the value of the output parameter of the hyper-params step, item per item, which means that this step will create as many containers in parallel as there are hyper-parameter-sets! More detailed example here.
So that's it for the workflow! You can now run it with argo submit -n argo path/to/transform-train.yaml and resubmit it every time you want to run the workflow.
Docker images can be built fast, but some external dependencies (python packages) can take quite a long time to install. For the sake of being able to test things out as fast as possible, I recommend using multi-stage builds for your bigger images, for example the ones doing training/evaluation.
All the artifacts stored during a workflow run will be under a directory that (by default) is named after the workflow name plus pod name. You can modify this with the data.artifactRepository.*.keyFormat field in your workflow-controller-configmap.
Now all you have to do is:
Repeat this over and over again until you succumb to madness or you get cool ROC values, whichever comes first!
This is the first blog post of a little series in which I will discuss Argo Workflows, parameter tuning, and regular machine learning model training sessions (on schedule or event based). This series will also tackle another project from the Argo-team, Argo Events.
by | 24.1.2024
by | 10.1.2024