Getting started with Training Models using Argo Workflows

by Nicolas Guinoiseau | 3.1.2024

Intro

Machine learning models are used in all sorts of applications and services to solve problems that are too difficult to solve with traditional computing techniques (image classification, pattern recognition, etc.). At the early stages of a project using pre-trained machine learning models is a quick way to get started. While this practice can help you figure out the problem to be solved, you often discover that your problem is beyond the basics and therefore requires more attention. Cloud Services such as Amazon SageMaker and Microsoft Azure Machine Learning offer platforms that include end-to-end machine learning pipelines, data transformation, and model training with related APIs. However, these services can prove costly and inflexible for the long run.

For the past two years, I have worked on a service that visualizes and predicts different outcomes from companies financial data, runs on a public cloud and have actively used Argo Workflows to solve various problems. I will demonstrate how we use it in our local machine learning experiments and dwell on the why's and what's.

What is Argo Workflows?

Argo Workflows is one of the Argo Projects, a collection of open source tools for Kubernetes. It is an "open source container-native workflow engine for orchestrating parallel jobs on Kubernetes" as they say themselves, and provide a good bullet-points overview here.

One of our favourites is "Argo Workflows puts a cloud-scale supercomputer at your fingertips!". Let that sink in for a bit.

Why do we want to use it?

Flexibility in the design of simple and complex pipelines
Kubernetes native (great for scale)
Run locally with: minikube, kind, k3s, k3d, Docker Desktop
Store all workflow data locally or remotely
Easily rerun jobs

What does a model training pipeline look like?

Typically it includes the following stages:

Get the raw training data (from some storage, e.g. S3)
Prepare the raw data for fit and evaluation
Fit and evaluate
Gather the evaluation metrics

You might want to use several hyperparameter sets, so let's add them! The training pipeline that we'll create on a high level looks like this:

Few things to take into consideration when designing training pipelines with Argo Workflows

Argo Workflows offers several ways to design multi-stage workflows: DAG, steps, or a combination of both. You can find a multitude of examples in the Argo Workflows repository. For our use-case, I found using steps more intuitive and using it made the logic easier to read.

There are two ways to pass data between stages in a workflow:

Parameters (size limit of 256 kB)
Artifacts

The artifact repository is used for persistent storage (training data, evaluation metrics, models, etc.) and supports a variety of solutions.

Setup for local experimentation

We used the excellent Argo Workflows - Quick Start tutorial to get started. If you want to follow along, you will need kubectl, argo, and a Kubernetes cluster of your choice. Argo Workflows requires a database and an artifact repository. We used the Quick-Start defaults Postgres & MinIO (left out for brevity).

Going further some familiarity with Kubernetes manifests is a plus!

apiVersion: kustomize.config.k8s.io/v1beta
kind: Kustomization

namespace: argo

resources:
  - https://github.com/argoproj/argo-workflows/releases/dowmload/vx.x.x/install.yaml

patches:
  - path: workflow-controller-configmap.yaml
    target:
      kind: ConfigMap

The workflows-controller-configmap file offers a variety of configuration options. Here we configure:

artifactRepository: where we want Argo Workflow to store artifacts (to be used by other steps and for long term storage)
persistence: see how the individual runs perform. You can also re-run jobs when needed

apiVersion: v1
kind: ConfigMap
metadata:
  name: workflow-controller-configmap
data:
  artifactRepository: |
    s3:
      bucket: dev-argo-artifact-repository
      endpoint: minio:9000
      insecure: true
      accessKeySecret:
        name: my-minio-cred
        key: accesskey
      secretKeySecret:
        name: my-minio-cred
        key: secretkey
  persistence: |
    connectionPool:
      maxIdleConns: 100
      maxOpenConns: 0
    nodeStatusOffLoad: false
    archive: false
    postgresql:
      host: argo-postgres
      port: 5432
      database: postgres
      tableName: argo_workflows
      userNameSecret:
        name: argo-postgres-config
        key: username
      passwordSecret:
        name: argo-postgres-config
        key: password

Templating steps for reusability

In our example each stage is defined as a single-step WorkflowTemplate. They are an easy way to reuse logic in different contexts, for instance, in a future tuning or when continuously retraining models.

The training pipeline Workflow uses WorkflowTemplates via TemplateRefs. From here, there are several ways of linking it all together:

Write all the individual WorkflowTemplates, then the main Workflow or vice versa
Write the first stage of the main Workflow, then the first WorkflowTemplate, and so on

In the following sections, we define WorkflowTemplates, the Workflow that will run and look at at how they reference each other. We'll also glance at a real hunky-dory way of using parameters.

Get training data

In our first step, there are no input arguments. We output raw training data as an artifact. In a real world example you would most likely fetch this data from an object storage or API and then pass it on.

The metadata.name will be referenced in the Workflow under the field spec.templates.steps[].templateRef. Please watch out for typos. ;)

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: get-data-template # This will be use in the templateRef field in the parent Workflow
spec:
  imagePullSecrets:
    - name: yoursecret
  serviceAccountName: argo
  templates:
    - name: get-data
      container:
        imagePullPolicy: IfNotPresent
        image: '{{workflow.parameters.image-transform}}'
        env:
          - name: WORKDIR
            value: '{{workflow.parameters.work-dir}}'
          - name: ARGO_RAW_DATA
            value: 'training-data.json'
          - name: WORKFLOW_ID
            value: '{{workflow.uid}}'
        command: [ "python3" ]
        args: [ "{{workflow.parameters.work.dir}}/get-data.py" ]
      outputs: # outputs and inputs artifacts/parameters are declare this way
        artifacts:
          - name: raw-training-data
            path: /path/to/your/outputs/file.json

Set hyperparameters

Here is an example on how the hyper-params WorkflowTemplate can be written, we pass the different hyper-parameter values that we want to go through as environment variables.

In this case, we want the output parameter to be a list of json objects of all the possible hyper-parameters combinations.

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: hyper-params-template # This will be use in the templateRef field in the parent Workflow
spec:
  imagePullSecrets:
    - name: yoursecret
  serviceAccountName: argo
  templates:
    - name: hyper-params
      container:
        imagePullPolicy: IfNotPresent
        image: '{{workflow.parameters.image-utilities}}'
        env:
          - name: WORKDIR
            value: '{{workflow.parameters.work-dir}}'
          - name: EPOCH
            value: '50,100'
          - name: WORDNGRAMS
            value: '1,2,3'
        command: [ "python3" ]
        args: [ "{{workflow.parameters.work.dir}}/hyper-params.py" ]
      outputs:
        parameters:
          - name: parameter-set
            valueFrom:
              default: "FooBar"
              path: '{{workflow.parameters.word-dir}}/hyper-params.json' # The output is a json object list

Evaluation results

The evaluate WorkflowTemplate takes as input parameter params which is one of the json objects previously outputted by the hyper-params step.

The syntax for passing parameters from one step to the other is shown in the Workflow section.

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: evaluate-template
spec:
  imagePullSecrets:
    - name: yoursecret
  serviceAccountName: argo
  templates:
    - name: evaluate
      container:
        imagePullPolicy: IfNotPresent
        image: '{{workflow.parameters.image-train}}'
        env:
          - name: WORKDIR
            value: '{{workflow.parameters.work-dir}}'
          - name: WORKFLOW_ID
            value: '{{workflow.uid}}'
        command: [ "python3" ]
        args: [ "{{workflow.parameters.work.dir}}/evaluate.py", "{{inputs.parameters.params}}" ]
      inputs:
        parameters:
          - name: params # Here will be passed one element of the json object list outputted by the hyper-params step. In this example we use it straight in the container start command above
        artifacts:
          - name: training-data
            path: /path/for/your/training-data.json
      outputs:
        artifacts:
          - name: metrics
            path: /path/to/your/metrics.json

Add resources to k8s

Let's add the WorkflowTemplate CRDs to Kubernetes so that our Workflow that we define next can use them: kubectl apply -k /path/to/kustomization.yaml.

apiVersion: kustomize.config.k8s.io/v1beta
kind: Kustomization

namespace: argo

resources:
  - get-data.yaml
  - evaluate.yaml
  - hyper-params.yaml
  - other-template.yaml

The training pipeline (read: Workflow)

All the WorkflowTemplates previously created are being used in each of the Workflow steps. You see them referenced in spec.templates[].steps[].templateRef.name, take note that the name of a step and the name of the WorkflowTemplate used in that step do not have to be the same. As mentioned earlier one WorkflowTemplate can be reused in any Workflows.

Three things about this Workflow:

First, if you are familiar with Kubernetes manifests, you will notice spec.templates.steps contains collections of collections. In this case the step get-data will start running first, then the steps transform and hyper-params will run in parallel once get-data has been completed successfully.

Second, the arguments of the steps here all refer to a previous output parameter or artifact.

Thirdly, for the step evaluate you will notice that its arguments.parameters[] is linked with a special field withParam. It takes the value of the output parameter of the hyper-params step, item per item, which means that this step will create as many containers in parallel as there are hyper-parameter-sets! More detailed example here.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: transform-train-
  labels:
    workflow.argoproj.io/archive-strategy: "false"
spec:
  imagePullSecrets:
    - name: yoursecret
  serviceAccountName: argo
  parallelism: 16 # This limits the amount of pods that can simultaneously run.
  entrypoint: transform-train
  arguments:
    parameters:
      - name: work-dir
        value: '/app'
      - name: image-train
        value: your-ai-train-image:latest
      - name: image-utilities
        value: your-ai-utilities-image:latest
      - name: image-get-data
        value: your-ai-etl-image:latest
  templates:
    - name: transform-train
      steps:
        - - name: get-data
            templateRef:
              name: get-data-template
              template: get-data
        - - name: transform
            templateRef:
              name: transform-template
              template: transform
            arguments:
              artifacts:
                - name: training-data # Here is how we pass an artifact from the step get-data to the step transform
                  from: "{{steps.get-data.artifacts.raw-training-data}}"
          - name: hyper-params
            templateRef:
              name: hyper-params-template
              template: hyper-params
        - - name: evaluate
            templateRef:
              name: evaluate-template
              template: evaluate
            arguments:
              artifacts:
                - name: training-data
                  from: "{{steps.transform.outputs.artifacts.training-data}}"
              parameters:
                - name: params
                  value: "{{item}}" # Here we don't refer directly to the output parameter of the stap hyper-param
            withParam: "{{steps.hyper-params.outputs.parameters.parameter-set}}" # We do it here
        - - name: gather-metrics
            templateRef:
              name: gather-metrics-template
              template: gather-metrics
            arguments:
              artifacts:
                - name: metrics
                  value: "{{steps.evaluate.outputs.artifacts.metrics}}"

Continuous experimentation

So that's it for the workflow! You can now run it with argo submit -n argo path/to/transform-train.yaml and resubmit it every time you want to run the workflow.

Docker images can be built fast, but some external dependencies (python packages) can take quite a long time to install. For the sake of being able to test things out as fast as possible, I recommend using multi-stage builds for your bigger images, for example the ones doing training/evaluation.

All the artifacts stored during a workflow run will be under a directory that (by default) is named after the workflow name plus pod name. You can modify this with the data.artifactRepository.*.keyFormat field in your workflow-controller-configmap.

Now all you have to do is:

Experiment
Check the metrics
Modify your logic
Rebuild images
Resubmit your workflow

Repeat this over and over again until you succumb to madness or you get cool ROC values, whichever comes first!

This is the first blog post of a little series in which I will discuss Argo Workflows, parameter tuning, and regular machine learning model training sessions (on schedule or event based). This series will also tackle another project from the Argo-team, Argo Events.

I hope you had a pleasant read and more importantly that you learned something useful

Nicolas Guinoiseau

Data Scientist

nicolas@distrikt.fi

Argo Events: Event-Driven Workflow Automation

by | 24.1.2024

Tuning and training machine learning models with Argo Workflows

by | 10.1.2024

Related cases

Cuuma Communications

Automated lead generation

STRATEGY

DESIGN

SOFTWARE

CONTENT

CASE STUDY

Caruna

Marketplace for a greener future

DESIGN

SOFTWARE

CASE STUDY

Got a project in mind that we could assist with?

Get in touch

Getting started with Training Models using Argo Workflows

Intro

What is Argo Workflows?

Why do we want to use it?

What does a model training pipeline look like?

Few things to take into consideration when designing training pipelines with Argo Workflows

Setup for local experimentation

Templating steps for reusability

Get training data

Set hyperparameters

Evaluation results

Add resources to k8s

The training pipeline (read: Workflow)

Continuous experimentation

I hope you had a pleasant read and more importantly that you learned something useful

Nicolas Guinoiseau

Data Scientist

nicolas@distrikt.fi

Related posts

Argo Events: Event-Driven Workflow Automation

Tuning and training machine learning models with Argo Workflows

Related cases

Automated lead generation

STRATEGY

DESIGN

SOFTWARE

CONTENT

CASE STUDY

Marketplace for a greener future

DESIGN

SOFTWARE

CASE STUDY

Got a project in mind that we could assist with?