Doing GitOps at Scale with Argo CD

Alexander Matyushentsev
Argo Project
Published in
7 min readJan 28, 2019

--

Argo CD is a GitOps continuous delivery tool for Kubernetes. The core component of Argo CD is the Application Controller, which continuously monitors running applications and compares the live application state against the desired target state defined in the Git repository. This powers the following use cases:

  • Automated deployment: controller pushes the desired application state into the cluster automatically, either in response to a Git commit, a trigger from CI pipeline, or a manual user request.
  • Observability: developers can quickly find if the application state is in sync with the desired state. Argo CD comes with a UI and CLI which helps to quickly inspect the application and find differences between the desired and the current live state.
  • Operation: Argo CD UI visualizes the entire application resource hierarchy, not just top-level resources defined in the Git repo. For example, developers can see ReplicaSets and Pods produced by the Deployment defined in Git. From the UI, you can quickly see Pod logs and the corresponding Kubernetes events. This turns Argo CD into very powerful multi-cluster dashboard.

To perform its duties, the Application Controller needs to retrieve the desired resource manifests from Git, load its matching live resources from the actual Kubernetes cluster, comparing the two, and updating the corresponding Application CRD instance. Sounds easy, right? As it turns out, our initial straightforward implementation only worked well for a couple clusters and a few dozen applications.

At Intuit, we now use Argo CD to manage hundreds of applications, deployed across dozens of Kubernetes clusters. As our Kubernetes adoption increased, several design changes were made to improve performance at scale and reduce applications reconciliation time from minutes to seconds.

First version

During the implementation of the first Argo CD version, we quickly found that some performance optimizations were needed even if we had a very small number of applications. We’ve identified two obvious performance issues: it could take time to clone the Git repository, and we could not execute tens of Kubernetes API queries every time the user wants to see an updated application state.

The cloning was solved by introducing the Repository Server. The Repository Server stores a copy of the Git repository, keeps it updated with new commits, and quickly serves the application manifests to the Application Controller when requested.

To avoid frequent Kubernetes API queries, we decided to store application resource state in the corresponding Application CRD instance. The Application Controller periodically reconciles the application state and saves the reconciliation results along with desired and live Kubernetes resource state in the Application CRD status field.

More caching

The above design worked well for few releases with slight modifications:

Argo CD supports various templating tools such as Kustomize, Helm, Ksonnet etc. We learned that even if the Git repository clone is readily available, simply generating the YAML (e.g. kustomize build) could also take time. To improve performance, the Repository Service started caching generated manifests in memory. This helped enormously since the target state in Git is not changed very often.

The next improvements were again related to Repository Service. In order to check if a given application’s manifests were cached, the Repository Service needed to resolve ambiguous revisions (e.g. HEAD, master, v1.0) into concrete git commit SHAs which were incorporated into the cache key. To resolve a revision we executed something like:

git fetch upstream <branch-name> && git git rev-parse <branch-name>

When the Repository Service was resolving revisions, it held a mutex on the repo, and could not generate manifests for any other application from the same Git repository. To avoid this problem, we started using the go-git library to resolve exact revision for given Git branch or tag, without cloning or checking out the repo. With this improvement, the Repository service eliminated all lock contention when manifests were already in the cache.

Controller Redesign

The first real scalability issue was actually reported by the community. It was common for a single application to become quite big and include hundreds of Kubernetes resources.

For example, the combined resource YAML produced by the Istio helm chart is greater than one megabyte, the etcd object size limit. This issue was partially addressed by moving cached resource state into an in-memory cache. This made it possible to manage even giant applications using Argo CD, but increased memory usage and still did not improve reconciliation performance. A single Istio reconciliation could take almost one minute while our goal was to provide real-time application viewing. At the same time, Argo CD adoption at Intuit kept growing — moving from dozens of applications to several hundreds, with the number growing on a daily basis.

Sample Knative application deployed with Argo CD

After multiple discussions, we decided to take a risk and redesign the Application Controller. The new design used Kubernetes watch API, and cached only essential, lightweight resource state in memory. While the Application Controller cannot store the whole cluster state in memory, it can store resource references, relationships between resources, and high-level sync and health statuses.

The redesign took some time and we encountered a lot of surprises! Kubernetes resources have versions; there is a deprecated extensions resource group and the copy of extensions resources present in apps, networking.k8s.io, and policy resources groups, and so on and so forth.

The performance improvement was very impressive! Istio reconciliation took ~0.5 seconds compared to around one minute before the changes. Now, Argo CD is able to present real-time application status and can be used as a very responsive Kubernetes troubleshooting tool.

In addition to performance improvements, the Application Controller itself uses less memory.

Removing bottlenecks

After some initial dogfooding, it was time to upgrade the first production Argo CD instance with ~150 applications on it. As you might guess, something went wrong! Even after the upgrade, there was almost no change in reconciliation speed. It still took around 40 seconds from Git commit to application refresh. After a short brainstorming session, we found a performance bottleneck where we never expected it.

During each app reconciliation, theApplication Controller was performing several very lightweight K8s API queries to retrieve settings information from Argo CD’s configmap. Although the queries were lightweight, we were making a lot of them, which triggered the K8s Golang client’s built-in throttling, defaulting our request rate to 5 QPS (queries per second). We migrated the queries to the informers framework and eliminated the bottleneck. The patch release was pushed to production Argo CD instances and we finally got the performance that we expected: ~150 applications were reconciled in less than 5 seconds! Time to celebrate?

Next surprise happened a few days after upgrading several Argo CD production instances. The Repository Service pod suddenly started using significantly more memory and got killed after reaching the limits.

The memory spike happened due to an existing issue which had never surfaced before. The Repository Service forks a new process (e.g. helm template, kustomize build, or ks show) to render Kubernetes manifests. Each such execution would result in a spike in memory. With the improved Application Controller performance, the Repository Service became the new bottleneck and started getting significantly more concurrent requests and could not handle the load.

The solution had two parts: 1) introduce the ability for the Repository service to use an optional, shared Redis cache. 2) add concurrent requests throttling to Repository service. With the throttling and shared cache, it now makes sense to run multiple instances of Repository Service. Each Repository Service uses a limited amount of memory and contributes manifests to the shared cache. A side benefit of this is that we also got HA!

Huge thanks to the community!

All of the changes described are available in Argo CD v0.11, the most scalable and fastest release yet, which has been happily running at Intuit since early January. We hope that you enjoy the benefits of all the work and attention to detail that have gone into it.

Finally, a huge thanks to our community of users and contributors who drove us in the right direction. A lot of ideas were suggested in Github issues and helped arrive at the right design. We also got a lot of very valuable feedback on release candidates and a lot of help with pre-release testing!

--

--

Co-founder and Chief Architect at Akuity | Argo Project Leader with 6-year tenure | ex-Intuit | Running the Enterprise Company for Argo — https://akuity.io