Argo Rollouts v1.1

Published in

Argo Project

8 min readOct 12, 2021

Our biggest release ever!

The Argo team is proud to announce the availability of Argo Rollouts v1.1. Argo Rollouts is a Kubernetes progressive delivery operator providing advanced blue-green and canary deployment strategies with automated promotions, rollbacks, and metric analysis. Despite being just a “minor” release, the v1.1 release turned out to be Argo Rollout’s biggest release ever, containing over a dozen significant features! Read on to see what’s new.

Rollout Notifications

Nearly two years ago, the Argo CD notifications project was started by Alexander Matyushentsev as a pet project to send notifications upon interesting Application events (e.g., when Applications degrade). Notification support had been a commonly requested feature in Argo CD, but we were reluctant to build native support for it in the core project since it required heavy integrations with many different notification providers along with complex configurations. So instead, we decided to spin notifications as its own project and controller to let it evolve on its own.

That decision turned out to be the correct one, as the Argo CD notifications controller has evolved to become a powerful, standalone notification engine and library, able to be incorporated into any project (not only Argo-related ones). Rollouts is the first project to benefit from this extensible design and now has full support of notifications services, including Slack, GitHub, Grafana, Microsoft Teams, e-mail, webhooks, and more!

For any Kubernetes Event which is emitted about Rollout (examples include: RolloutDegraded, RolloutStepCompleted, RolloutPaused, AnalysisRunFailed) you can configure the Rollout to send a notification about that event using either our pre-built templates or your own templates with customizable metadata about the Rollout.

Dynamic Scaling of Stable ReplicaSet

One of the limitations of Rollout’s approach to ingress/mesh integrated canaries was that it left the stable stack fully scaled during the entire course of the update. The reason for this was intentional — it was a design goal to abort and roll back to the stable immediately, at any point during the update, without delays related to pod scheduling or start time.

However, this meant that the canary Rollout would at some point, double the number of replicas that were running (similar to a blue-green update). This behavior was a limiting factor for many Rollout users, and the ability to scale down the stable ReplicaSet as traffic shifted to canary was one of our most popular requested issues.

A new dynamicStableScale flag has been introduced to dynamically scale the stable ReplicaSet according to traffic weight:

spec:
  replicas: 10
  strategy:
    canary:
      dynamicStableScale: true
      trafficRouting:
        ...
      steps:
      - setWeight: 20
      - pause: {}

Using this flag, a Rollout will resize the replicas of the stable ReplicaSet, as traffic shifts away from it to the canary. In the above example, following thesetWeight: 20 step, the stable ReplicaSet would reduce to 8 replicas, since only 80% of traffic was directed to the stable. This feature makes Rollouts much more flexible to be used in any environment.

Automated Rollbacks Without Analysis

Until now, it was only possible to automatically abort and roll back by coupling a Rollout with analysis. The AnalysisRun would need to fail, resulting in an automated rollback to stable. However, not all users (or workloads) are ready for this level of progressive delivery. Luckily, as it turns out, users already have a basic form of “analysis” in the form of pod readiness and availability. In fact, using these heuristics to perform automated rollbacks was even considered for Deployments as far back as Kubernetes 1.2!

It is now possible to automatically abort a Rollout simply by exceeding spec.progressDeadlineSeconds. No analysis needed! To use this feature, simply set progressDeadlineAbort to true:

spec:
  progressDeadlineAbort: true
  progressDeadlineSeconds: 600

Using this flag, if the Rollout ever fails to make progress within the specified deadline, the Rollout will automatically abort and shift weight back 100% to the stable.

Kustomize Open API Schema

For years, Kustomize lacked proper support for CRDs, making it difficult to use with Rollouts. Users who tried to apply a Kustomize patch to a Rollout in the same way they would a Deployment would see surprising differences in behavior since Kustomize did not have the same understanding of a Rollout object like it did with Deployments.

Kustomize 4 has now addressed this by allowing users to specify an Open API schema in their kustomization.yaml. Rollouts now generates an Open API schema to “teach” Kustomize about the Rollout so that strategic merge patches can be applied the same way you expect with a Deployment.

To use this, simply reference the Open API schema, as well as the rollout transformer configuration in your kustomization.yaml:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

configurations:
- <path-to-directory>/rollout-transform.yaml

openapi:
  path: <path-to-directory>/rollout_cr_schema.json

With these two files, you can finally have the same experience working with Rollouts and Kustomize that you enjoy with Deployments.

Rollout Dashboard as a Service

By popular demand, the Rollout dashboard, first introduced in v1.0 as part of the kubectl-argo-rollouts plugin, can now be run as a standalone Kubernetes deployment/service. Note that we DO NOT recommend running the dashboard as a service in any production environment since it provides no authentication. The kubectl-argo-rollouts plugin is now published as a container image so that it can be deployed to a cluster (e.g., for demo purposes) or be invoked in CI/CD pipelines to perform actions such as promoting or pausing rollouts.

Controlling Scaledown Behavior During Aborts

One annoyance users had faced with Rollout is that when the Rollout aborted, users could not control the behavior of when or whether the canary/preview stack would scale down. Even worse, there were unintentional differences in behavior across the update strategies. This inconsistency has been rectified, and a new abortScaleDownDelaySeconds field has been introduced to control this behavior.

spec:
  strategy:
    canary:
      abortScaleDownDelaySeconds: 600

The field allows you to decide how long to keep the canary or preview stack scaled up in the event of an abort.

Analysis: AWS CloudWatch Metric Provider

AWS users, especially those using the AWS LoadBalancer Controller integration, will be pleased to know that Argo Rollouts now supports CloudWatch as a metric provider. It is possible to perform CloudWatch queries using the same syntax you are used to in the AWS console:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
    - name: success-rate
      interval: 1m
      successCondition: 'all(result[0].Values, {# <= 0.01})'
      failureLimit: 3
      provider:
        cloudWatch:
          interval: 30m
          metricDataQueries:
            - id: rate
              expression: errors / requests
            - id: errors
              metricStat:
                metric:
                  namespace: app
                  metricName: errors
                period: 300
                stat: Sum
                unit: Count
              returnData: false
            - id: requests
              metricStat:
                metric:
                  namespace: app
                  metricName: requests
                period: 300
                stat: Sum
                unit: Count
              returnData: false

AWS TargetGroup IP Verification

One little-known fact about EKS and the AWS LoadBalancer Controller is that it is considered unsafe to modify the selectors of a Service after an Ingress has already been targeting that service, specifically when the AWS Ingress is running in IP traffic mode. Read more to understand why. Since Argo Rollouts works by modifying the selectors of services (e.g., as in a blue-green update), there is a risk that the Rollout might complete an update, scale down the old stack before the AWS LoadBalancer controller has reconciled the changes to the Service selectors. This risk is considered low but may happen if the AWS LoadBalancer Controller is down or cannot complete its updates in time due to rate limiting.

To mitigate this risk, Argo Rollouts improves upon its existing weight verification feature (introduced in v1.0), with a new IP verification feature. Using IP verification (enabled with the same --aws-verify-target-group controller flag), the rollout controller will perform additional AWS API calls to verify that all the Endpoints IP of the Service are correctly registered to the AWS TargetGroup. Only after it has confirmed the targets are registered properly will the Rollout controller deem it is safe to scale down the old stack, thereby preventing the possibility of downtime.

Weighted Experiment Canary Steps

If you are an advanced canary analysis user, or coming from Spinnaker, you may already be aware of the benefits and best practices of a baseline vs. canary comparison (as opposed to production vs. canary comparison). In short, launching a new set of “baseline” pods simultaneously as the canary pods for the purposes of analysis allows for an apples-to-apples comparison of metrics. It eliminates variables attributed to long-lived pods that would otherwise skew the analysis, such as hot caches, heap sizes, etc…

It has always been possible to perform an Experiment step to start N-variations of your workload as part of an update. But now, in Rollouts v1.1, you can additionally leverage your Ingress or Service Mesh to split traffic N-ways during the update, simply by specifying a weight for the Experiment template:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: guestbook
spec:
  replicas: 10
  strategy:
    canary:
      trafficRouting:
        smi: {}
      steps:
      - experiment:
          duration: 1h
          templates:
          - name: experiment-baseline
            specRef: stable
            weight: 5
          - name: experiment-canary
            specRef: canary
            weight: 5

In the above example, the Rollout runs a one hour Experiment, splitting traffic three ways:

5% to the canary pod
5% to the newly created baseline pod
90% to the existing production pods

This enables fine-grained (i.e. lower risk) traffic splitting, while also allowing the canary to be compared equally against a baseline pod.

Istio: Multicluster Support

Istio supports a multicluster setup where the “primary” Istio cluster might be different from the cluster where the Rollout workloads are running. For Istio users running in this configuration, Argo Rollouts now supports the ability to manage of VirtualServices which are deployed in a different cluster.

Istio: TLS Route Support

Istio supports VirtualService route rules for non-terminated TLS and HTTPS traffic. In Rollouts v1.1, you can now perform canary traffic splitting against these TLS routes. To do so, the Rollout should reference either the port number or the SNI hostname in the VirtualService.

canary:
  trafficRouting:
    virtualServices:
    - name: rollouts-demo-vsvc
      tlsRoutes:
      - match:
        # Match the port number and/or SNI hosts in virtualService
        - port: 3000
          sniHosts:
          - reviews.bookinfo.com
          - localhost

Istio: Multiple VirtualServices

A third improvement made to the Istio provider is that it is now possible for a Rollout to update multiple VirtualServices in lockstep. This is necessary for Istio setups where a service’s routes are managed in multiple VirtualService objects rather than just a single one. For example, a common use case is to deploy two VirtualServices to reach the service: the first for external ingress traffic and the second for internal mesh requests.

AnalysisRun GC

A common annoyance reported by users was that a Rollout did not clean up old AnalysisRuns and Experiments aggressively enough, resulting in a lot of clutter from obsolete objects in the namespace. To address this, Rollouts now provides two knobs to garbage-collect old AnalysisRuns and Experiments:

spec:
  analysis:
    successfulRunHistoryLimit: 10
    unsuccessfulRunHistoryLimit: 10

Separate controls are provided to retain a different number of successful runs vs. unsuccessful (i.e. Failed, Error, Inconclusive). If omitted, five successful runs and five unsuccessful runs will be retained by default.

Analysis: Graphite Metric Provider

Finally, Rollouts continues to improve its integration library with native support for the popular Graphite monitoring tool.

What’s Next

Rollouts v1.1 was an action-packed release that addressed many popular enhancement requests and usability suggestions from the community. You can download the release at our GitHub page. A special thanks to all of the maintainers and community contributors for making this release happen! As we look to the roadmap for 1.2, there’s a lot of work needed to be done to keep up with the ever-evolving Kubernetes networking space and API developments, and Rollouts is a bit overdue for a bit of code cleanup and hygiene.