Infrastructure as Code: CI/CD, Kubernetes, and Zero-Downtime Deployments

The gap between how infrastructure was managed five years ago and how it should be managed today is wider than most engineering leaders recognise. Configuration drift, undocumented server states, manual deployment steps, and the absence of rollback capability are not technical debt in the abstract — they are active business risks that materialise as incidents.

Infrastructure as Code (IaC), combined with automated CI/CD pipelines and container orchestration, eliminates this class of risk. This guide covers the engineering patterns that make this work in production.

Why Infrastructure as Code

The defining principle of IaC is that infrastructure configuration is stored as version-controlled code and applied programmatically — never through manual console clicks, SSH sessions to production, or undocumented tribal knowledge.

The benefits compound:

Reproducibility. Spinning up a staging environment that is identical to production takes minutes, not days. Every environment — development, staging, production — is defined in code and can be created from scratch deterministically.

Drift elimination. Configuration drift is when production no longer matches what the team believes it is. IaC eliminates drift because the code is the authoritative definition — any deviation is either caught by compliance scanning or corrected by re-applying the configuration.

Audit trails. Every infrastructure change goes through a pull request with review, approval, and a Git history. "Who changed the load balancer timeout?" is a Git blame away rather than an incident postmortem excavation.

Disaster recovery. When an environment needs to be rebuilt from scratch — because of a catastrophic failure, a region outage, or a security incident — IaC reduces the recovery time from days to hours.

Terraform as the Foundation

Terraform is the most widely adopted IaC tool for cloud infrastructure, and for good reason. Its declarative configuration language (HCL), provider ecosystem spanning all major cloud platforms, and state management model make it the right default for greenfield infrastructure projects.

Module Architecture

Organise Terraform into reusable modules that represent logical infrastructure components:

infrastructure/
├── modules/
│   ├── networking/        # VPC, subnets, security groups
│   ├── compute/           # ECS/GKE clusters, node pools
│   ├── database/          # RDS/Cloud SQL instances, parameter groups
│   ├── cdn/               # CloudFront/Cloudflare distributions
│   └── monitoring/        # CloudWatch/Datadog dashboards, alerts
├── environments/
│   ├── staging/
│   │   └── main.tf        # Calls modules with staging-specific vars
│   └── production/
│       └── main.tf        # Calls modules with production vars
└── shared/
    └── state-backend.tf   # Remote state configuration

Modules should be parameterised, not environment-specific. The environment configuration calls the module with the appropriate variable values.

Remote State and Locking

Never store Terraform state locally. Use remote state backends (S3 + DynamoDB for AWS, GCS for GCP, Terraform Cloud) with state locking to prevent concurrent modifications from corrupting state.

terraform {
  backend "s3" {
    bucket         = "tecsynth-terraform-state"
    key            = "production/terraform.tfstate"
    region         = "eu-west-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
  }
}

Containerisation with Docker

Containers are the unit of deployment in modern infrastructure. Every application service should be containerised — the same image that runs in CI runs in staging and production, with environment-specific configuration injected at runtime via environment variables or secrets.

Multi-Stage Builds

Production Docker images should be minimal. Multi-stage builds compile the application in a full builder image and copy only the compiled output into a minimal runtime image:

FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production=false
COPY . .
RUN npm run build

FROM node:20-alpine AS runner
WORKDIR /app
ENV NODE_ENV=production
COPY --from=builder /app/.next/standalone ./
COPY --from=builder /app/.next/static ./.next/static
COPY --from=builder /app/public ./public
EXPOSE 3000
CMD ["node", "server.js"]

This produces images that are typically 70–80% smaller than naive single-stage builds — smaller images pull faster, have a smaller attack surface, and cost less to store.

Image Security

Scan images for CVEs before pushing to production registries. Tools like Trivy, Snyk, or AWS ECR scanning catch known vulnerabilities in base images and dependencies. Integrate scanning into the CI pipeline and gate deployments on scan results.

CI/CD Pipeline Architecture

A well-designed CI/CD pipeline is the automated delivery mechanism that takes a code change from a developer's machine to production safely and repeatably.

Pipeline Stages

A production-grade pipeline typically has these stages in sequence:

1. Lint and Type Check — Fast feedback on obvious errors. Should complete in under 2 minutes. Block merge on failure.

2. Unit and Integration Tests — Automated test suite runs against the change. Coverage requirements gate the merge.

3. Build — Compile the application and build the Docker image. Tag the image with the Git commit SHA.

4. Security Scanning — Scan the built image for CVEs. Scan infrastructure code with Checkov or Terrascan.

5. Deploy to Staging — Apply Terraform changes to the staging environment. Deploy the new image to the staging Kubernetes cluster. Run smoke tests against staging.

6. Integration/E2E Tests — Run end-to-end tests against the staging deployment. This is the highest-confidence gate before production.

7. Deploy to Production — Trigger a rolling deployment or blue-green switch in production. Monitor error rates and latency for 5–10 minutes. Automated rollback if metrics degrade beyond threshold.

GitHub Actions Implementation

name: Deploy
on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci && npm run test:ci

  build:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Build and push image
        run: |
          docker build -t $ECR_REGISTRY/app:$GITHUB_SHA .
          docker push $ECR_REGISTRY/app:$GITHUB_SHA

  deploy-staging:
    needs: build
    environment: staging
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to staging
        run: kubectl set image deployment/app app=$ECR_REGISTRY/app:$GITHUB_SHA

  deploy-production:
    needs: deploy-staging
    environment: production
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to production
        run: kubectl set image deployment/app app=$ECR_REGISTRY/app:$GITHUB_SHA

Kubernetes Orchestration

Kubernetes is the de facto standard for container orchestration at scale. The key workload configurations for production web services:

Deployment Configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    spec:
      containers:
        - name: app
          image: app:latest
          resources:
            requests:
              memory: "256Mi"
              cpu: "250m"
            limits:
              memory: "512Mi"
              cpu: "500m"
          readinessProbe:
            httpGet:
              path: /api/health
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /api/health
              port: 3000
            initialDelaySeconds: 15
            periodSeconds: 20

The key settings: `maxUnavailable: 0` ensures no pods are killed before the replacement is ready (zero-downtime). Readiness probes prevent traffic from routing to pods that are not yet ready. Resource limits prevent a runaway process from starving the node.

Blue-Green Deployments

For changes that are too risky for rolling updates (database migrations, breaking changes), blue-green deployments provide instantaneous switchover with instant rollback:

Deploy the new version as a separate Deployment ("green")
Run smoke tests against the green deployment via a separate service endpoint
Switch the production Service selector from the blue to green label
Monitor for 10–15 minutes
Decommission the blue deployment if metrics are healthy; switch back if they degrade

The switchover and rollback are both instantaneous — a `kubectl apply` that changes the Service selector. This is the highest-confidence deployment pattern for high-stakes changes.

Observability Infrastructure

Infrastructure without observability is infrastructure you cannot trust. Three pillars:

Metrics — Application and infrastructure metrics (CPU, memory, request rate, error rate, latency percentiles) streamed to Prometheus or Datadog. Alerting rules on error rate spikes and latency degradation.

Logs — Structured JSON logs aggregated in a centralised log store (Loki, Elasticsearch, CloudWatch). Every log line should include trace ID, tenant ID (for multi-tenant systems), request ID, and severity.

Traces — Distributed tracing via OpenTelemetry captures the full path of a request across services. Essential for diagnosing latency issues in microservice architectures.

The Starting Point

If your team is not yet using IaC, the path forward is incremental:

Start with Terraform for new infrastructure — do not touch existing manually-configured infrastructure yet
Containerise one application service at a time
Build the CI/CD pipeline in parallel — even a pipeline that just runs tests and builds an image is valuable
Add Kubernetes after the containerisation and CI/CD foundations are in place

The compounding value of this investment accrues over years. The organisations with the best deployment reliability, the fastest incident recovery, and the lowest infrastructure operational burden are consistently those that invested in IaC and CI/CD before they were under pressure to — not in response to a production incident.