Running Big Data Pipelines on Kubernetes without Crashing the Cluster

Kubernetes Community Days Munich 2022
2022-10-12, Munich

Slides

Abstract

Loading data into a central Data Lake or Data Warehouse typically involves computationally demanding batch jobs. As Kubernetes becomes the primary infrastructure platform for many companies, there is increasing demand to migrate these workloads from e.g. Hadoop clusters to Kubernetes. However, support for batch processes is still relatively immature in Kubernetes. In particular heavy batches can challenge the stability of the whole cluster.

In this talk, I will give a high-level introduction to data pipelines for infrastructure-focused engineers, present learnings from running heavy batch jobs on Kubernetes and discuss the creation of cloud-native data pipelines using Argo Workflows and GCP’s Spark on Kubernetes Operator.