Making complex R applications into production using Argo Workflows

Pedro Szloma Herr Zaterka
Argo Project
Published in
6 min readJan 18, 2022

--

Hello there! My name is Pedro Zaterka and I’m a Research Data Scientist and Machine Learning Engineer at 4intelligence. We are a Brazilian startup that focuses on economics and AI, both offering consulting services as well as an AutoML platform we call FaaS (Forecast as a Service). The main bottleneck and what Argo helped us solve focuses on the latter, but all fields are intertwined, as our internal teams also use our FaaS to a great extent.

Forecast as a Service — How does it work?

Our application consists of a no code front end interface where the user can upload an .csv, .xlsx or .rds file, get additional curated variables and send one time-series at a time to be forecasted for the chosen horizon. Another input option is using one of our libraries (in R or Python) to send multiple time series at once, and there’s where one of our main concerns start to appear: scalability.

Forecasts as shown inside our Forecast as a Service platform

The application itself was firstly built entirely using R (from APIs using Plumber to front-end using Shiny) and ran on a monolithic structure, using Virtual Machines on Google Cloud, each running one project at a time. While this did not represent such a big issue at first, with most users using our front end, with our package access allowing the user to send large amounts of data, as well as an ever growing number of users, our ability to run as many jobs as possible in parallel and with high cost-efficiency became a major priority.

Rethinking the architecture

Firstly, we needed to take a wider look at how the application looked like from an architectural point of view. It was built in an organic fashion, coming from a common pipeline that ran in local machines for consultancy tasks, to being shifted and lifted to the cloud as it was, using a third party package to do the interface with the GCP. We knew that the orchestration itself would move away from R, but our first crossroad came when we had to decide if it was worth it to keep R at all or refactor the pipeline in a more cloud-friendly language, such as Python.

The ever heated and personally, long overdue, debate between data science languages was in front of us, and the answer became clearer as we started to put the pros and cons side-by-side:

  • Most of our data scientist currently come from an R background;
  • We had a battle-tested reliable pipeline that was built on some R specific libraries that would need major refactoring;
  • The refactoring process alone could take even longer than the architectural change itself.

Therefore, we settled on keeping R for time being, but the new architecture should be able to handle a mixed pipeline because we planned to add Python in the near future.

Cleaning the house

The first problem we tackled was optimizing the Docker Images we were using. What would happen sometimes is that the current image would take up to 3 hours to build, which was not suitable for the CI/CD pipeline we had in mind.

To ensure only the necessary packages were included, we started building it from scratch using the alpine-based image from r-minimal. Alas, creating the image was not so trivial with R after all, as many package versions in our environment conflicted. We had to install, remove and reinstall some packages, using multi-stage builds with some intermediate images kept for other applications.

The final result, we decreased the building time from 3 hours to 30 seconds.

Architecture 1.5

Now that we had the right image we started tackling the architecture itself. To break the monolith and add decoupling and flexibility in retry strategies, we took everything outside modelling from R, and created what we call the decoder step.

What we called the decoder is a preliminary step, where the input is broken by each series, that will be sent to different workflows to run in parallel. It served as a way to make it easier to know how many time series we would have to forecast so we could scale our infrastructure accordingly and as a way for larger requests to not take as long as before.

It is written in Python, and receives the encoded body, decodes it (hence the given name) and outputs multiple JSON files that move forward in the pipeline as an input into the R modelling.

With the first concept of the pipeline ready, we went on to implement it to run as an API structure on Google Cloud Run, which highly improved the performance and was quite fast to implement and deploy. However, it did prove itself to be unreliable in terms of cost predictability from the get-go, so we started looking for a novel approach and decided to pursue a more stable and SOTA solution using Kubernetes.

Moving to Argo

Envisioning a microservices view of the pipeline and having in mind further breakdowns of the pipeline and the inclusion of multiple programming languages, we needed an agnostic orchestrator, that could handle our initial R pipeline and be further improved and developed. Also, we needed resiliency and monitoring, with cost efficiency and scalability.

Taking theses factors in consideration, we chose to go with Argo Workflows (and Argo Events with it). It checked all the boxes we needed and some more:

  • Language agnostic (which means it can run R);
  • Cloud agnostic (in case we need to run processes elsewhere);
  • Has an active and helpful community;
  • Being a Linux Foundation’s CNCF hosted project gave us further confidence and reassured our stakeholders that Argo was the right solution;
  • And of course, Argo can use all the tailor-made Docker Images in an isolated ecosystem.

Since we had an eager team ready to try and learn how to configure and deploy our clusters inside Google Kubernetes Engine, we started working on the implementation.

We got a lot of help from the Argo Community during the process and that allowed us to roll-out the initial version of the new infrastructure in a small amount of time. Both the Slack Channel and the Github Repo provided a lot of information and answers and the community was eager to help, which was also a great upside for choosing Argo.

How it looks like nowadays

The following picture might give a better understanding at how we are currently running our application inside GCP.

KubeFaaS Architecture

We have multiple customers using the application, both external as well as our own team of applied data scientists, who use our results in consultancy projects. They send their jobs through an API request, which is itself in a separate project, and it then sends the content to a bucket.

A pubsub topic then triggers a cloud function that sends the event to Argo Events, starting the pipeline. The project has no external access at any point, only being accessed by service accounts.

Once it’s completed, the output files go to an output bucket, where our applications can consume it or the user can download the results directly through our API.

What’s coming next

As it is today, the R block is still monolithic, a smaller one, but still far from what we deem as ideal. Our roadmap in the near future consists in breaking this part of the pipeline into smaller steps, using Argo’s modularity and agnosticity to include different languages (the first one planned is to include some models implemented in Python) and put it all together at the end. Also, we are in a constant improvement process, tweaking and testing new configurations to better optimize the cost-efficiency of the application.

--

--