postgres on kubernetes with cloudnativepg

cloudnativepg is a kubernetes operator purpose-built for postgresql. it handles replication, failover, backups, and monitoring through the same declarative model used for the rest of the infrastructure.

may 10, 2025 / kubernetes,postgresql,cnpg,operators,platform engineering

running databases on kubernetes used to be impractical. the storage layer was unreliable, operators were immature, and the failure modes were poorly understood. local nvme, mature csi drivers, and purpose-built operators have changed that.

cloudnativepg - or cnpg - is a kubernetes operator built specifically for postgresql. it doesn’t try to support every database or abstract postgres behind a generic interface. it makes postgresql work the way kubernetes resources are supposed to work: declarative, reconciled, observable.

why an operator

running postgres in a container is straightforward. running a production postgres cluster with streaming replication, automatic failover, point-in-time recovery, connection pooling, and tls requires significant operational effort. an operator encodes that knowledge into a controller that watches the desired state and reconciles toward it.

cnpg manages the full lifecycle: bootstrapping new clusters, adding replicas, promoting standbys during failover, taking backups, restoring from them. the desired state is described in a Cluster custom resource and the operator handles reconciliation.

the alternative is managing all of this with statefulsets, init containers, sidecar processes, and bash scripts held together by convention. these setups tend to become fragile and are difficult to hand off.

installing the operator

cnpg installs cleanly via helm:

helm repo add cnpg https://cloudnative-pg.github.io/charts
helm upgrade --install cnpg cnpg/cloudnative-pg \
  --namespace cnpg-system \
  --create-namespace

the operator runs in its own namespace and watches all namespaces by default. it installs the Cluster, Backup, ScheduledBackup, Pooler, and related crds. once it’s running, postgres clusters are deployed by creating Cluster resources in the application’s namespace.

defining a cluster

a minimal production cluster looks like this:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: example
  namespace: production
spec:
  instances: 3
  storage:
    size: 50Gi
    storageClass: gp3
  postgresql:
    parameters:
      shared_buffers: "256MB"
      effective_cache_size: "768MB"
      max_connections: "200"
      work_mem: "4MB"
      maintenance_work_mem: "64MB"
      wal_buffers: "8MB"
      random_page_cost: "1.1"
  resources:
    requests:
      cpu: "1"
      memory: "2Gi"
    limits:
      memory: "2Gi"
  monitoring:
    enablePodMonitor: true

three instances results in one primary and two streaming replicas. cnpg handles the replication topology automatically - the primary accepts writes, replicas stream wal segments and serve reads. the operator continuously monitors replication lag and instance health.

the storageClass should be a csi driver that supports volume snapshots for fast backups. gp3 on aws, premium-ssd-v2 on azure, pd-ssd on gcp. network-attached block storage with high latency variance should be avoided - postgres is sensitive to i/o latency on wal writes.

the postgresql parameters block maps directly to postgresql.conf. cnpg does not abstract over postgres configuration. standard postgres tuning applies, and the operator handles rolling restarts when settings change.

bootstrapping and initialization

new clusters start empty by default. cnpg also supports bootstrapping from a backup, an existing cluster, or a sql script.

bootstrap from a sql file:

spec:
  bootstrap:
    initdb:
      database: example
      owner: example
      postInitApplicationSQL:
        - CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
        - CREATE EXTENSION IF NOT EXISTS pgcrypto;

bootstrap from a backup - useful for cloning production to staging:

spec:
  bootstrap:
    recovery:
      source: example-backup
      recoveryTarget:
        targetTime: "2025-05-09T14:00:00Z"

the recoveryTarget enables point-in-time recovery. cnpg restores the base backup and replays wal segments up to the specified timestamp. this allows recovery from accidental data loss to the exact moment before the problem occurred, rather than falling back to the last nightly dump.

backups

cnpg supports two backup methods: volume snapshots and object store via barman.

volume snapshots use the csi snapshot api to take a consistent point-in-time copy of the storage volume. they’re fast because they’re copy-on-write at the storage layer:

spec:
  backup:
    volumeSnapshot:
      className: csi-snapclass

for object store backups, cnpg uses barman under the hood. it connects to an s3 bucket, azure blob container, or gcs bucket and streams base backups and wal archives continuously:

spec:
  backup:
    barmanObjectStore:
      destinationPath: s3://database-backups/example
      s3Credentials:
        accessKeyId:
          name: aws-credentials
          key: access_key_id
        secretAccessKey:
          name: aws-credentials
          key: secret_access_key
      wal:
        compression: gzip
        maxParallel: 4
      data:
        compression: gzip

wal archiving is the critical component. base backups provide a snapshot in time. continuous wal archiving captures every transaction between snapshots. together they enable point-in-time recovery to any moment, not just the last backup window.

schedule backups with a ScheduledBackup resource:

apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
  name: example-daily
spec:
  schedule: "0 0 2 * * *"
  backupOwnerReference: self
  cluster:
    name: example
  method: barmanObjectStore

this schedules a base backup every night at 2am. wal segments stream continuously in between. the backupOwnerReference: self means the backup resource is garbage collected when the scheduled backup is deleted.

failover

failover traditionally requires external tooling like pacemaker or patroni. cnpg handles it natively.

the operator monitors every instance with a configurable health check. when the primary becomes unreachable - node failure, network partition, oom kill - cnpg promotes the most up-to-date replica to primary. the remaining replicas repoint their replication to the new primary. the whole process takes seconds.

to inspect the current state:

kubectl cnpg status example

the output shows the current topology, replication lag per replica, wal position, and timeline. during a failover the primary switch and replica re-attachment are visible here.

fencing is built in. if a failed instance comes back, cnpg doesn’t blindly reattach it. the old primary might have transactions that diverged from the new timeline. cnpg fences the instance and either rebuilds it from the new primary’s data or requires manual intervention, depending on how far it diverged.

manual switchovers for maintenance are also supported:

kubectl cnpg promote example example-2

this specifies the cluster and the target instance. the operator orchestrates a clean switchover - draining connections, waiting for replication to catch up, promoting, and reconfiguring.

connection pooling

most production postgres deployments need connection pooling. application frameworks that open a connection per request will exhaust max_connections quickly, and raising that limit trades memory for concurrency in a way that doesn’t scale.

cnpg includes a built-in pgbouncer integration via the Pooler resource:

apiVersion: postgresql.cnpg.io/v1
kind: Pooler
metadata:
  name: example-pooler-rw
spec:
  cluster:
    name: example
  instances: 2
  type: rw
  pgbouncer:
    poolMode: transaction
    parameters:
      default_pool_size: "25"
      max_client_conn: "200"

separate poolers can be created for read-write (rw) and read-only (ro) traffic. the rw pooler routes to the primary. the ro pooler load-balances across replicas. applications connect to the pooler service instead of the database directly.

transaction-mode pooling works for most workloads. it assigns a server connection for the duration of a transaction, then returns it to the pool. session mode ties a connection to a client session, which limits the benefit of pooling for short-lived connections.

tls and authentication

cnpg generates tls certificates automatically for every cluster. all connections between instances use mutual tls. client connections can be configured to require tls as well:

spec:
  postgresql:
    pg_hba:
      - hostssl all all all scram-sha-256

this replaces the default pg_hba.conf and forces all client connections to use tls with scram authentication. cnpg stores the generated certificates in kubernetes secrets, which application pods can mount.

for service-to-service authentication inside the cluster, the operator creates secrets with the connection credentials:

kubectl get secret example-app -o jsonpath='{.data.uri}' | base64 -d

this returns a connection string the application can consume directly. the secret is updated automatically when credentials rotate.

monitoring

cnpg ships with built-in prometheus metrics. setting enablePodMonitor: true in the cluster spec creates PodMonitor resources that prometheus picks up automatically when the prometheus operator is installed.

the exposed metrics cover the key areas of postgres monitoring:

replication lag per replica, in bytes and seconds
transaction rates - commits, rollbacks, conflicts
buffer cache hit ratios
connection counts by state - active, idle, waiting
wal generation rate and archival status
backup status - last successful backup, age, wal archiving lag

replication lag is typically the most important metric to alert on. if a replica falls behind, it may be under-resourced, the network between nodes may be saturated, or the storage can’t keep up with wal writes. the alert threshold should match the target rpo - for example, if the rpo is 30 seconds, alert when lag exceeds 10.

cnpg also exposes a json status endpoint on each instance. the kubectl cnpg status command provides a single-pane view of the cluster, which can be referenced in runbooks for on-call assessments.

useful prometheus queries:

# replication lag in seconds
cnpg_pg_replication_lag{cluster="example"}

# backup age - alert if older than 25 hours
time() - cnpg_collector_last_available_backup_timestamp{cluster="example"}

# cache hit ratio - should be above 0.99 (requires custom monitoring query)
cnpg_pg_stat_database_blks_hit / (cnpg_pg_stat_database_blks_hit + cnpg_pg_stat_database_blks_read)

upgrades and minor version bumps

minor version upgrades - 16.9 to 16.10, for example - are handled by changing the image tag in the cluster spec:

spec:
  imageName: ghcr.io/cloudnative-pg/postgresql:16.10

cnpg performs a rolling update: it upgrades replicas first, one at a time, verifies replication health, then does a switchover to a replica and upgrades the old primary last. zero downtime if the application handles reconnects - the pooler absorbs the brief interruption during switchover.

major version upgrades - 15 to 16 - require pg_upgrade and are more involved. cnpg supports this through the bootstrap-from-backup flow: spin up a new cluster on the target version, bootstrap it from a backup of the old cluster, verify, and switch traffic.

what to watch out for

storage latency. postgres performance is directly tied to storage latency. wal writes are synchronous - every commit waits for the wal flush. storage with tail latency spikes will cause transaction latency spikes. benchmarking the storage class with pgbench before use is recommended.

memory tuning. shared_buffers should be about 25% of available memory. effective_cache_size should be about 75%. cnpg does not tune these automatically - they need to be set based on the instance’s resource limits.

failover testing. deleting a primary pod in staging and observing the promotion helps validate the failover path. measuring promotion time and verifying application reconnects are both worth doing before relying on failover in production.

wal archiving failures. if wal archiving to the object store stops working, the point-in-time recovery window stops growing. the base backup might still be valid, but recovery is limited to the last archived wal segment. monitoring cnpg_collector_pg_stat_archiver_failed_count and alerting on any non-zero value catches this early.

pod disruption budgets. cnpg creates pdbs automatically. during node maintenance, kubernetes needs to know it can’t evict all database pods simultaneously. the default pdb allows one instance to be unavailable at a time - verify this matches the expected maintenance behavior.

references

[1] cloudnativepg documentation. “architecture.”
cloudnative-pg.io/documentation/current/architecture

[2] cloudnativepg documentation. “backup and recovery.”
cloudnative-pg.io/documentation/current/backup

[3] cloudnativepg documentation. “connection pooling.”
cloudnative-pg.io/documentation/current/connection_pooling

[4] cloudnativepg documentation. “monitoring.”
cloudnative-pg.io/documentation/current/monitoring

[5] cloudnativepg documentation. “failover.”
cloudnative-pg.io/documentation/current/failover

[6] cloudnativepg documentation. “bootstrap.”
cloudnative-pg.io/documentation/current/bootstrap

[7] postgresql documentation. “wal configuration.”
postgresql.org/docs/current/wal-configuration.html