Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 72 additions & 0 deletions src/contents/blogs/enterprise-three-layer-architecture/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
---
title: "From 37 million monthly executions: how enterprise teams structure Kestra at scale"
description: "How enterprise teams structure Kestra for scale, isolation, and data residency. Covers the tenant vs. namespace decision framework and worker group patterns for regulated and hybrid deployments."
date: 2026-06-10T09:00:00
category: Solutions
author:
name: Parham Parvizi
image: pparvizi
linkedin: https://www.linkedin.com/in/xdatanomad/
role: Solutions Engineer
image: ./main.png
---

[Leroy Merlin](../2022-02-22-leroy-merlin-usage-kestra/index.md) runs 37 million workflow executions a month on Kestra. They started on Airflow. Flows ran 20x slower than expected, and one bad task could take down the entire cluster. Teams couldn't ship independently without risking the platform everyone else relied on.

What they built wasn't just a migration. It was a different architecture, one designed from the start to let teams operate independently at scale. Three layers, each solving a distinct problem: how do you scale execution, how do you isolate teams, and where does the work actually run?

AI workloads are adding GPU routing and data residency requirements that most orchestration tools weren't designed for. The architecture choices made at deployment time determine whether those requirements get met through configuration or through rebuilding.

I work with enterprise teams on Kestra architecture daily. The mental model I keep coming back to has three layers.

## Layer 1: scaling execution

Kestra is four components: a webserver, a scheduler, an executor, and workers.

The webserver is the front door, all UI traffic and API access flow through it. The scheduler is the trigger watchdog: it monitors everything that needs to be kicked off, whether that's a cron schedule, a webhook call, or a Kafka message arriving in a queue. The executor is the brain, it coordinates workflow execution across workers. Workers are where tasks actually run, the only components that touch external systems like databases, APIs, or cloud services.

All four components are active-active. You can run multiple instances of every component simultaneously. Two executors coordinate through the queue backend. Two workers each pick up different tasks. If any instance goes down, the others continue without interruption.

Most production clusters deploy on [Kubernetes](../../docs/02.installation/03.kubernetes/index.md), where each component is a pod. Kubernetes handles the failover and recovery mechanics; the component design gives you the scaling levers. Executors and workers are the ones you scale aggressively with load. Schedulers and webservers typically run at one or two replicas each. For very large deployments, the PostgreSQL backend can be upgraded to Kafka plus Elasticsearch, which handles higher sustained throughput without changing the component model.

## Layer 2: isolating teams

Horizontal scaling answers the performance question. It doesn't answer the organizational one: how do 200 engineers share a platform without stepping on each other?

Kestra handles this through two constructs: [namespaces](../../docs/07.enterprise/02.governance/07.namespace-management/index.md) and [tenants](../../docs/07.enterprise/02.governance/tenants/index.md).

**Namespaces** are project spaces within a cluster. Each one has its own workflows, scripts, secrets, and KV store for temporary values. [RBAC](../../docs/07.enterprise/03.auth/rbac/index.md) is enforced at the namespace level, so you can give one team full access to their namespace without exposing anything else. Namespaces are hierarchical, a root namespace like `company` can have children like `company.data` or `company.infra`, and child namespaces inherit configuration from their parents while being able to override it locally.

**Tenants** go further. A tenant is a completely isolated environment within the cluster: separate user management, separate roles, separate admin users. Someone logged into Tenant A has no visibility into Tenant B, even on shared infrastructure. Each tenant can have its own tenant admins who manage users, groups, and custom RBAC within that tenant independently.

The question that comes up in almost every enterprise deployment: when do you use tenants instead of namespaces?

It comes down to whether groups need separate user management. If different business units need their own user bases, their own admin control, and zero visibility into each other's work, tenants are the answer. If they're within the same organization and the requirement is project-level separation, different teams, different workflows, different access levels, namespaces are sufficient.

Think of it like a building: tenants are floors, namespaces are apartments on those floors. A floor has its own access system and its own admin, with no visibility into what's happening on other floors. An apartment divides space within a floor, each with its own locks and boundaries, but still under the same building management. Tenants create that floor-level separation. Namespaces organize the space within it.

Here's how tenants and namespaces relate to each other within a single Kestra cluster.

![Tenants vs. Namespaces in Kestra](./kestra-tenant-namespace-diagram.png)

## Layer 3: controlling where work runs

For many teams, "on Kubernetes" is sufficient. For regulated industries, global deployments, and hybrid architectures, location is a real constraint. [Worker groups](../../docs/07.enterprise/04.scalability/worker-group/index.md) are how Kestra handles it.

A worker group is a named logical grouping of workers. You install Kestra workers wherever execution needs to happen, a different cloud region, a different cloud VPC, an on-premises cluster, assign them to a named group, and Kestra routes tasks to that group by name. Workers within the group can be scaled independently. If one worker goes down, another in the same group picks up the task. The workflow doesn't know or care which worker ran it.

In practice this unlocks three patterns: routing tasks to workers in specific cloud regions to satisfy data residency requirements, running workers on-premises while keeping the control plane in the cloud, and routing GPU-intensive ML tasks to workers on specialized hardware while standard tasks run on standard compute. For a detailed breakdown of how routing, failover, and replica counts work within worker groups, see [The executor/worker split: the design decision behind Kestra's distributed architecture](../kestra-executor-worker-architecture/index.md).

Worker groups can be pinned to tenants or namespaces. A tenant can have a dedicated worker group so its workflows only run on its own workers. A namespace within a tenant can have its own group if it has different resource requirements. For regulated industries, the minimum viable pattern is one worker group per tenant: each tenant gets its own dedicated execution environment.

Kestra licenses worker groups, not individual workers. Scaling workers within an existing group to handle more throughput doesn't affect your licensing. Adding a new group does, because each group represents a distinct operational boundary.

## How the layers compose

The three layers answer independent questions, but they combine. A regulated industry deployment might look like this: a distributed Kubernetes cluster divided into two tenants for two business units, each tenant with its own dedicated worker group, workers in each group deployed in separate cloud regions to satisfy data residency requirements.

Leroy Merlin's 37 million monthly executions needed the infrastructure layer to hold. But raw scale wasn't what made the migration work. Namespace isolation is what let 200 engineers ship independently without any one team putting the platform at risk. The physical layer is what would let them extend to new regions or satisfy data residency requirements without rebuilding.

The layers are designed to be independent. A single-tenant, single-region deployment can still use dozens of namespaces. A multi-tenant, multi-region deployment can have a single namespace per tenant. When a compliance requirement adds a new region, you add a worker group. When a new business unit needs its own environment, you add a tenant. The work is configuration, not rebuilding.

Kestra is open source. You can [get started in minutes](../../docs/02.installation/index.md) with Docker, one command, first workflow running in under five minutes. If you're scoping a deployment across regions, business units, or regulated infrastructure, [Book a Demo](/demo) and we can work through the architecture together.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
96 changes: 96 additions & 0 deletions src/contents/blogs/kestra-executor-worker-architecture/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
---
title: "The executor/worker split: the design decision behind Kestra's distributed architecture"
description: "The architectural reasoning behind Kestra's executor/worker separation, and what it means for replica counts, backend selection, and extending execution across regions, on-premises clusters, and GPU hardware."
date: 2026-06-15T09:00:00
category: Engineering
author:
name: Parham Parvizi
image: pparvizi
linkedin: https://www.linkedin.com/in/xdatanomad/
role: Solutions Engineer
image: ./main.png
---

When I deploy Kestra with enterprise customers, the first thing I walk them through is the component model. Understanding what each piece does, and what it doesn't do, is what makes every subsequent decision clear: how many replicas you need, when to upgrade your backend, how to design worker groups for a multi-region or hybrid deployment. As deployments grow to span regulated zones, multiple clouds, and GPU hardware for AI workloads, those decisions get harder to undo.

## Why the executor and worker are separate

Kestra has four components, and the most important design decision is the split between the executor and the worker.

**The webserver** is the entry point for the cluster. It handles all UI traffic and all API requests. It's stateless, you can run multiple replicas behind a load balancer and they don't coordinate with each other. Scale it when API traffic is high; most deployments run one or two.

**The scheduler** watches triggers. Cron schedules, webhooks, Kafka messages, file arrivals, the scheduler monitors all of them and fires an execution event when a trigger condition is met. It uses distributed locking to prevent double-firing, so multiple scheduler replicas give you failover without conflicting. One or two replicas is standard.

**The executor** coordinates workflow execution. It receives an execution event, walks the workflow DAG, decides which tasks are ready to run, and dispatches them to the queue. When a task completes, the executor processes the result and determines what runs next. It handles retries, error paths, and parallel branches. The executor never makes external network calls, it only reads and writes to the internal queue and repository.

**Workers** run tasks. A worker picks a task off the queue, executes it, querying a database, calling an API, running a Python script, and puts the result back. Workers are the only components with access to external systems.

Because the executor has no external dependencies, its failures are always infrastructure failures, never application failures. Because workers are isolated to execution, you can scale them, replace them, and move them to different infrastructure without touching anything else. More throughput means more workers.

All four components are active-active. Multiple replicas of any component run simultaneously without conflict. Kubernetes handles pod lifecycle; the queue backend handles coordination.

The diagram below shows how the four components relate, what sits between them, and where remote worker groups fit in.

![Kestra core architectural components: webserver, scheduler, executor, and worker, with essential external services and optional remote worker groups](./kestra-architecture-diagram.png)

## Deploying on Kubernetes

Most production Kestra deployments run on [Kubernetes](../../docs/02.installation/03.kubernetes/index.md). Kestra provides Helm charts for the full installation, including a `values.yaml` where you configure the deployment mode and replica counts.

A distributed deployment looks like this in `values.yaml`:

```yaml
deploymentMode: distributed

webserver:
replicaCount: 1

scheduler:
replicaCount: 1

executor:
replicaCount: 2

worker:
replicaCount: 2
```

Each component deploys as a separate pod. The executor and worker counts are what you tune as load grows. The `distributed` deployment mode is distinct from standalone mode, which runs all components in a single process. Standalone works for development and evaluation. Production uses distributed so components can be scaled and replaced independently.

## What to scale and when

Crédit Agricole's CAGIP team runs Kestra as a shared orchestration backbone across 100+ managed clusters and 7 data teams, self-hosted on their private cloud. Getting the component sizing right is what made that possible at that scale.

The executor and worker are the components you scale under load. If workflows are slow to complete even though individual tasks finish quickly, you need more executors. If tasks themselves are slow to start or finish, you need more workers.

Very large deployments run tens or hundreds of workers. The executor handles coordination logic, which is CPU and memory intensive at high parallelism but doesn't scale the same way. A few executor replicas handle a large worker fleet.

For higher sustained throughput, the PostgreSQL backend can be upgraded to Kafka plus Elasticsearch. Kafka takes over as the queue; Elasticsearch replaces PostgreSQL as the repository for execution history. The component model stays identical, nothing in how you configure workers, executors, or schedulers changes. Only the backend plumbing changes.

## Extending execution beyond the cluster

A [worker group](../../docs/07.enterprise/04.scalability/worker-group/index.md) is a named logical grouping of workers. You install Kestra workers on whatever infrastructure you need, another cloud region, an on-premises data center, a cluster with specific hardware, assign them to a named group, and Kestra routes tasks to that group. Tasks in your workflow definition specify which group they should run on. Workers in that group pick them up; workers in other groups don't see them.

Within a worker group, scaling and failover work the same way they do in the main cluster. If a worker goes down while running a task, the executor detects the failure and retries the task on another available worker in the same group. Running multiple workers per group is standard practice, one worker gives you routing but no redundancy.

Three patterns come up most often in production:

**Multi-region execution.** Tasks that need to run in an EU region route to workers installed there. Tasks that run in US route to US workers. A single workflow definition spans both. Data residency requirements are enforced at the routing level, not the workflow level.

**Hybrid and on-premises.** The main Kestra cluster runs in the cloud. Workers run on-premises, connecting out to the cluster. Customers get a unified control plane and a unified workflow definition; execution happens on infrastructure they control. No inbound network access to the on-premises environment is required.

**Heterogeneous hardware.** Tasks that require GPU are routed to workers installed on GPU-enabled machines. Standard tasks go to standard workers. You provision specialized hardware only where workflows need it.

Worker groups can be pinned to [tenants](../../docs/07.enterprise/02.governance/tenants/index.md) or [namespaces](../../docs/07.enterprise/02.governance/07.namespace-management/index.md). A tenant can have a dedicated worker group so its workflows only run on its own workers. A namespace within a tenant can have its own dedicated group if it needs different hardware or network placement. The typical pattern for regulated industries: one worker group per tenant at minimum, additional groups as individual namespaces require them.

Licensing is per worker group, not per worker. Adding workers within an existing group to handle more load doesn't affect licensing. Adding a new group, a new region, a new on-premises site, a new isolated environment, does.

## Where to start

Most teams start with a single worker group, PostgreSQL backend, and two or three workers. That handles a wide range of production workloads. Add workers when task throughput is the bottleneck. Add executor replicas when coordination is. Upgrade to Kafka plus Elasticsearch when sustained volume makes PostgreSQL the constraint.

Worker groups come in when the deployment needs to span locations: a second cloud region, an on-premises cluster, hardware with specific requirements.

The component model stays the same at any scale. What changes is how many of each component you run, where you run the workers, and which backend you've wired in underneath.

Kestra is open source. You can [get started in minutes](../../docs/02.installation/index.md) with Docker, one command, first workflow running in under five minutes. If you're scoping a deployment across regions, business units, or regulated infrastructure, [Book a Demo](/demo) and we can work through the architecture together.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion src/contents/blogs/yaml-vs-python-workflow/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ title: "YAML vs Python Workflows: Which Is Better for Orchestration?"
description: "Python excels at execution logic. YAML excels at defining workflows. Here's a practical breakdown of when to reach for each, and how modern orchestrators let you use both."
metaTitle: "YAML vs Python Workflows: Orchestration Comparison"
metaDescription: "Compare YAML vs Python workflows for orchestration. Understand their strengths, weaknesses, and when to choose each for defining and executing modern, scalable workflows."
date: 2026-06-15T13:00:00
date: 2026-06-11T13:00:00
category: Tutorials
tag: "orchestration"
author:
Expand Down
Loading