Kubernetes & Container Orchestration
Building Self-Healing Kubernetes Systems with Operators: A Complete Guide
How to Build Self-Healing Systems with Kubernetes Operators and Custom Resources
Self-healing systems in Kubernetes rely on the Operator pattern, where Custom Resources (CRs) define desired state and Controllers reconcile actual state through continuous observation and remediation loops.
Architecture of Self-Healing
The Operator pattern extends Kubernetes by introducing domain-specific knowledge into the control plane. Custom Resource Definitions (CRDs) define the API schema, while the Controller implements the reconciliation logic that maintains system health.
Core Components
- Custom Resource Definition (CRD): Schema defining the desired state structure
- Custom Resource (CR): Instance of the CRD containing specific configuration
- Controller: Infinite loop observing current state and taking corrective actions
- Reconciler: Function implementing the Observe-Analyze-Act pattern
Defining the Custom Resource
The CRD schema must separate spec (user-defined desired state) from status (controller-observed actual state). The status subresource must be explicitly declared to enable status updates.
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: healingservices.example.com
spec:
group: example.com
versions:
- name: v1
served: true
storage: true
subresources:
status: {}
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
replicaCount:
type: integer
minimum: 1
healthThreshold:
type: integer
default: 3
failureThreshold:
type: integer
default: 5
status:
type: object
properties:
observedGeneration:
type: integer
healthyReplicas:
type: integer
lastHealTime:
type: string
format: date-time
additionalPrinterColumns:
- name: Replicas
type: integer
jsonPath: .spec.replicaCount
- name: Healthy
type: integer
jsonPath: .status.healthyReplicas
- name: Age
type: date
jsonPath: .metadata.creationTimestamp
scope: Namespaced
names:
plural: healingservices
singular: healingservice
kind: HealingService
The status subresource enables the controller to report observed state without user intervention, while spec remains the immutable source of truth. Without the subresources: status: {} declaration, Status().Update() calls will fail with 404 errors. The additionalPrinterColumns field enhances kubectl get output with meaningful status information.
Controller Manager Setup
Before implementing the reconciler, initialize the controller manager with production-ready configuration including leader election and metrics exposure.
package main
import (
"flag"
"os"
"k8s.io/apimachinery/pkg/runtime"
utilruntime "k8s.io/apimachinery/pkg/util/runtime"
clientgoscheme "k8s.io/client-go/kubernetes/scheme"
_ "k8s.io/client-go/plugin/pkg/client/auth/gcp"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/healthz"
"sigs.k8s.io/controller-runtime/pkg/log/zap"
"sigs.k8s.io/controller-runtime/pkg/metrics/server"
examplev1 "example.com/healing-operator/api/v1"
"example.com/healing-operator/internal/controller"
)
var (
scheme = runtime.NewScheme()
setupLog = ctrl.Log.WithName("setup")
)
func init() {
utilruntime.Must(clientgoscheme.AddToScheme(scheme))
utilruntime.Must(examplev1.AddToScheme(scheme))
}
func main() {
var metricsAddr string
var enableLeaderElection bool
var probeAddr string
flag.StringVar(&metricsAddr, "metrics-bind-address", ":8080", "The address the metric endpoint binds to.")
flag.StringVar(&probeAddr, "health-probe-bind-address", ":8081", "The address the probe endpoint binds to.")
flag.BoolVar(&enableLeaderElection, "leader-elect", false,
"Enable leader election for controller manager. "+
"Enabling this will ensure there is only one active controller manager.")
opts := zap.Options{
Development: true,
}
opts.BindFlags(flag.CommandLine)
flag.Parse()
ctrl.SetLogger(zap.New(zap.UseFlagOptions(&opts)))
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
Scheme: scheme,
Metrics: server.Options{BindAddress: metricsAddr},
HealthProbeBindAddress: probeAddr,
LeaderElection: enableLeaderElection,
LeaderElectionID: "healing-operator.example.com",
})
if err != nil {
setupLog.Error(err, "unable to start manager")
os.Exit(1)
}
if err = (&controller.HealingServiceReconciler{
Client: mgr.GetClient(),
Scheme: mgr.GetScheme(),
}).SetupWithManager(mgr); err != nil {
setupLog.Error(err, "unable to create controller", "controller", "HealingService")
os.Exit(1)
}
if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {
setupLog.Error(err, "unable to set up health check")
os.Exit(1)
}
if err := mgr.AddReadyzCheck("readyz", healthz.Ping); err != nil {
setupLog.Error(err, "unable to set up ready check")
os.Exit(1)
}
setupLog.Info("starting manager")
if err := mgr.Start(ctrl.SetupSignalHandler()); err != nil {
setupLog.Error(err, "problem running manager")
os.Exit(1)
}
}
Leader election ensures only one controller instance runs in HA deployments, preventing duplicate reconciliation actions. The metrics server exposes Prometheus-compatible metrics at /metrics for observability.
The Reconcile Loop Logic
The reconciliation function implements the core self-healing mechanism: observe current state, compare with desired state, and take corrective actions.
package controller
import (
"context"
"fmt"
"sort"
"time"
corev1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/api/errors"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/types"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
"sigs.k8s.io/controller-runtime/pkg/log"
"sigs.k8s.io/controller-runtime/pkg/reconcile"
examplev1 "example.com/healing-operator/api/v1"
)
//+kubebuilder:rbac:groups=example.com,resources=healingservices,verbs=get;list;watch;create;update;patch;delete
//+kubebuilder:rbac:groups=example.com,resources=healingservices/status,verbs=update
//+kubebuilder:rbac:groups=example.com,resources=healingservices/finalizers,verbs=update
//+kubebuilder:rbac:groups=core,resources=pods,verbs=get;list;watch;create;update;patch;delete
const requeueDelay = 5 * time.Second
const healDelay = 30 * time.Second
func (r *HealingServiceReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := log.FromContext(ctx)
// 1. Fetch the HealingService instance
service := &examplev1.HealingService{}
if err := r.Get(ctx, req.NamespacedName, service); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// 2. Observe current state - fetch managed pods
podList := &corev1.PodList{}
if err := r.List(ctx, podList, client.InNamespace(req.Namespace),
client.MatchingLabels{"app": service.Name}); err != nil {
return ctrl.Result{}, err
}
// 3. Analyze health status
healthyCount := 0
for _, pod := range podList.Items {
if isPodHealthy(&pod) {
healthyCount++
}
}
// 4. Scale to desired replica count (both up and down)
desiredReplicas := service.Spec.ReplicaCount
currentReplicas := len(podList.Items)
if currentReplicas < desiredReplicas {
podsNeeded := desiredReplicas - currentReplicas
log.Info("Scaling up", "current", currentReplicas, "desired", desiredReplicas)
for i := 0; i < podsNeeded; i++ {
if err := r.createNewPod(ctx, service); err != nil {
if !errors.IsAlreadyExists(err) {
log.Error(err, "Failed to create pod")
return ctrl.Result{RequeueAfter: requeueDelay}, err
}
}
}
return ctrl.Result{RequeueAfter: requeueDelay}, nil
}
if currentReplicas > desiredReplicas {
podsToDelete := currentReplicas - desiredReplicas
log.Info("Scaling down", "current", currentReplicas, "desired", desiredReplicas)
// Separate unhealthy and healthy pods
var unhealthyPods, healthyPods []corev1.Pod
for _, pod := range podList.Items {
if isPodHealthy(&pod) {
healthyPods = append(healthyPods, pod)
} else {
unhealthyPods = append(unhealthyPods, pod)
}
}
// Sort healthy pods by creation time (oldest first for stable scale-down)
sort.Slice(healthyPods, func(i, j int) bool {
return healthyPods[i].CreationTimestamp.Before(&healthyPods[j].CreationTimestamp)
})
// Delete unhealthy pods first, then healthy pods if needed
deletedCount := 0
for _, pod := range unhealthyPods {
if deletedCount >= podsToDelete {
break
}
if err := r.Delete(ctx, &pod); err != nil && !errors.IsNotFound(err) {
log.Error(err, "Failed to delete pod", "pod", pod.Name)
return ctrl.Result{RequeueAfter: requeueDelay}, err
}
deletedCount++
}
for _, pod := range healthyPods {
if deletedCount >= podsToDelete {
break
}
if err := r.Delete(ctx, &pod); err != nil && !errors.IsNotFound(err) {
log.Error(err, "Failed to delete pod", "pod", pod.Name)
return ctrl.Result{RequeueAfter: requeueDelay}, err
}
deletedCount++
}
return ctrl.Result{RequeueAfter: requeueDelay}, nil
}
// 5. Update status with conflict handling
service.Status.HealthyReplicas = healthyCount
service.Status.ObservedGeneration = service.Generation
if err := r.Status().Update(ctx, service); err != nil {
if errors.IsConflict(err) {
log.Info("Conflict updating status, requeueing")
return ctrl.Result{Requeue: true}, nil
}
return ctrl.Result{}, err
}
// 6. Act - trigger healing if below threshold
if healthyCount < service.Spec.HealthThreshold {
log.Info("Triggering self-healing", "healthy", healthyCount, "required", service.Spec.HealthThreshold)
// Delete unhealthy pods for recreation
deletedCount := 0
for _, pod := range podList.Items {
if !isPodHealthy(&pod) {
if err := r.Delete(ctx, &pod); err != nil {
if !errors.IsNotFound(err) {
log.Error(err, "Failed to delete unhealthy pod", "pod", pod.Name)
return ctrl.Result{RequeueAfter: requeueDelay}, err
}
} else {
deletedCount++
}
}
}
if deletedCount > 0 {
r.Event(service, corev1.EventTypeNormal, "HealingTriggered",
fmt.Sprintf("Deleted %d unhealthy pods for recreation", deletedCount))
service.Status.LastHealTime = metav1.Now()
if err := r.Status().Update(ctx, service); err != nil {
if errors.IsConflict(err) {
return ctrl.Result{Requeue: true}, nil
}
return ctrl.Result{}, err
}
return ctrl.Result{RequeueAfter: healDelay}, nil
}
}
return ctrl.Result{}, nil
}
func isPodHealthy(pod *corev1.Pod) bool {
if pod.Status.Phase != corev1.PodRunning {
return false
}
// Check PodReady condition
for _, condition := range pod.Status.Conditions {
if condition.Type == corev1.PodReady && condition.Status != corev1.ConditionTrue {
return false
}
}
// Check container states for various failure conditions
for _, containerStatus := range pod.Status.ContainerStatuses {
state := containerStatus.State
if state.Waiting != nil {
reason := state.Waiting.Reason
if reason == "CrashLoopBackOff" ||
reason == "ImagePullBackOff" ||
reason == "ErrImagePull" {
return false
}
}
if state.Terminated != nil && state.Terminated.ExitCode != 0 {
return false
}
}
return true
}
func (r *HealingServiceReconciler) createNewPod(ctx context.Context, service *examplev1.HealingService) error {
pod := &corev1.Pod{
ObjectMeta: metav1.ObjectMeta{
GenerateName: fmt.Sprintf("%s-", service.Name),
Namespace: service.Namespace,
Labels: map[string]string{
"app": service.Name,
},
OwnerReferences: []metav1.OwnerReference{
{
APIVersion: "example.com/v1",
Kind: "HealingService",
Name: service.Name,
UID: service.UID,
Controller: controllerutil.TruePtr(),
},
},
},
Spec: corev1.PodSpec{
Containers: []corev1.Container{
{
Name: "app",
Image: "nginx:latest",
},
},
},
}
return r.Create(ctx, pod)
}
func (r *HealingServiceReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&examplev1.HealingService{}).
Owns(&corev1.Pod{}).
Complete(r)
}
The idempotency requirement ensures the reconciler can run multiple times without side effects—each execution should produce the same result given the same state. RBAC markers grant the controller necessary permissions to manage pods and update CR status. The isPodHealthy() function checks container states to detect CrashLoopBackOff, ImagePullBackOff, ErrImagePull, and failed terminations. Scale-down logic prioritizes deletion of unhealthy pods first, then uses stable selection (oldest pods first) for healthy pods to minimize disruption. Conflict handling prevents status update failures during concurrent modifications.
Validation Webhook
Validation webhooks enforce constraints on CR creation and updates, preventing invalid configurations like negative replica counts. The webhook must return errors to reject invalid specs.
package webhook
import (
"context"
"fmt"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/apimachinery/pkg/util/validation/field"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/webhook/admission"
examplev1 "example.com/healing-operator/api/v1"
)
func (r *HealingService) ValidateCreate(ctx context.Context, obj runtime.Object) (admission.Warnings, error) {
service := obj.(*examplev1.HealingService)
var allWarnings admission.Warnings
var allErrs field.ErrorList
specPath := field.NewPath("spec")
if service.Spec.ReplicaCount <= 0 {
allErrs = append(allErrs, field.Invalid(
specPath.Child("replicaCount"),
service.Spec.ReplicaCount,
"replicaCount must be greater than 0",
))
}
if service.Spec.HealthThreshold < 1 {
allErrs = append(allErrs, field.Invalid(
specPath.Child("healthThreshold"),
service.Spec.HealthThreshold,
"healthThreshold must be at least 1",
))
}
if service.Spec.HealthThreshold > service.Spec.ReplicaCount {
allErrs = append(allErrs, field.Invalid(
specPath.Child("healthThreshold"),
service.Spec.HealthThreshold,
"healthThreshold cannot exceed replicaCount",
))
}
if len(allErrs) > 0 {
return allWarnings, admission.NewForbidden(service, allErrs)
}
return allWarnings, nil
}
func (r *HealingService) ValidateUpdate(ctx context.Context, oldObj, newObj runtime.Object) (admission.Warnings, error) {
return r.ValidateCreate(ctx, newObj)
}
func (r *HealingService) ValidateDelete(ctx context.Context, obj runtime.Object) (admission.Warnings, error) {
return nil, nil
}
//+kubebuilder:webhook:path=/validate-example-com-v1-healingservice,mutating=false,failurePolicy=fail,sideEffects=None,groups=example.com,resources=healingservices,verbs=create;update,versions=v1,name=vhealingservice.kb.io,admissionReviewVersions=v1
The webhook returns admission.Warnings and error. Errors cause the admission request to be rejected with a 403 Forbidden response, preventing invalid configurations from being persisted. The field.ErrorList provides structured error messages that Kubernetes displays in kubectl apply output.
Advanced Self-Healing Patterns
Beyond simple pod recreation, sophisticated healing strategies require additional Kubernetes primitives.
Finalizers for Clean-up
Finalizers block deletion until all associated resources are properly cleaned up, preventing orphaned resources during healing cycles.
metadata:
finalizers:
- example.com/cleanup-protection
Event Recording
Publish Kubernetes events to provide audit trails for healing actions.
r.Event(service, corev1.EventTypeNormal, "HealingTriggered",
fmt.Sprintf("Recreated %d unhealthy pods", len(unhealthyPods)))
Rate Limiting and Requeue Strategies
The controller-runtime provides built-in rate limiting for reconciliation. Configure exponential backoff in the manager setup:
import (
"time"
"k8s.io/client-go/util/workqueue"
"sigs.k8s.io/controller-runtime/pkg/controller"
)
func (r *HealingServiceReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&examplev1.HealingService{}).
Owns(&corev1.Pod{}).
WithOptions(controller.Options{
RateLimiter: workqueue.NewMaxOfRateLimiter(
workqueue.NewItemExponentialFailureRateLimiter(5*time.Second, 300*time.Second),
workqueue.NewItemFastSlowRateLimiter(5*time.Second, 60*time.Second),
),
}).
Complete(r)
}
Metrics and Observability
Controller-runtime exposes Prometheus metrics automatically. Key metrics include:
controller_runtime_reconcile_total: Total reconciliation attemptscontroller_runtime_reconcile_errors_total: Failed reconciliationscontroller_runtime_workqueue_depth: Current queue depthcontroller_runtime_workqueue_retries_total: Retry attempts
Access metrics at http://<pod-ip>:8080/metrics when deployed.
Getting Started
- Install Kubebuilder:
go install sigs.k8s.io/kubebuilder/v4/cmd/kubebuilder@latest - Initialize project:
kubebuilder init --domain example.com --repo example.com/healing-operator - Create API:
kubebuilder create api --group example --version v1 --kind HealingService - Create webhook:
kubebuilder create webhook --group example --version v1 --kind HealingService --defaulting --programmatic-validation - Implement Reconcile logic in
internal/controller/healingservice_controller.go - Implement webhook validation in
internal/webhook/healingservice_webhook.go - Add RBAC markers above the Reconcile function
- Run locally:
make runor deploy withmake install && make deploy - Enable leader election in production:
kubectl set env deployment/healing-operator-manager LEADER_ELECT=true - Create test instance:
kubectl apply -f config/samples/example_v1_healingservice.yaml
Share this Guide:
More Guides
API Gateway Showdown: Kong vs Ambassador vs AWS API Gateway for Microservices
Compare Kong, Ambassador, and AWS API Gateway across architecture, performance, security, and cost to choose the right gateway for your microservices.
12 min readGitHub Actions vs GitLab CI vs Jenkins: The Ultimate CI/CD Platform Comparison for 2026
Compare GitHub Actions, GitLab CI, and Jenkins across architecture, scalability, cost, and security to choose the best CI/CD platform for your team in 2026.
7 min readKafka vs RabbitMQ vs EventBridge: Complete Messaging Backbone Comparison
Compare Apache Kafka, RabbitMQ, and AWS EventBridge across throughput, latency, delivery guarantees, and operational complexity to choose the right event-driven architecture for your use case.
4 min readChaos Engineering: A Practical Guide to Failure Injection and System Resilience
Learn how to implement chaos engineering using the scientific method: define steady state, form hypotheses, inject failures, and verify system resilience. This practical guide covers application and infrastructure-level failure injection patterns with code examples.
4 min readScaling PostgreSQL for High-Traffic: Read Replicas, Sharding, and Connection Pooling Strategies
Master PostgreSQL horizontal scaling with read replicas, sharding with Citus, and connection pooling. Learn practical implementation strategies to handle high-traffic workloads beyond single-server limits.
4 min readContinue Reading
API Gateway Showdown: Kong vs Ambassador vs AWS API Gateway for Microservices
Compare Kong, Ambassador, and AWS API Gateway across architecture, performance, security, and cost to choose the right gateway for your microservices.
12 min readGitHub Actions vs GitLab CI vs Jenkins: The Ultimate CI/CD Platform Comparison for 2026
Compare GitHub Actions, GitLab CI, and Jenkins across architecture, scalability, cost, and security to choose the best CI/CD platform for your team in 2026.
7 min readKafka vs RabbitMQ vs EventBridge: Complete Messaging Backbone Comparison
Compare Apache Kafka, RabbitMQ, and AWS EventBridge across throughput, latency, delivery guarantees, and operational complexity to choose the right event-driven architecture for your use case.
4 min read