running databases on kubernetes used to be impractical. the storage layer was unreliable, operators were immature, and the failure modes were poorly understood. local nvme, mature csi drivers, and purpose-built operators have changed that.
cloudnativepg - or cnpg - is a kubernetes operator built specifically for postgresql. it doesn’t try to support every database or abstract postgres behind a generic interface. it makes postgresql work the way kubernetes resources are supposed to work: declarative, reconciled, observable.
why an operator
running postgres in a container is straightforward. running a production postgres cluster with streaming replication, automatic failover, point-in-time recovery, connection pooling, and tls requires significant operational effort. an operator encodes that knowledge into a controller that watches the desired state and reconciles toward it.
cnpg manages the full lifecycle: bootstrapping new clusters, adding replicas, promoting standbys during failover, taking backups, restoring from them. the desired state is described in a Cluster custom resource and the operator handles reconciliation.
the alternative is managing all of this with statefulsets, init containers, sidecar processes, and bash scripts held together by convention. these setups tend to become fragile and are difficult to hand off.
installing the operator
cnpg installs cleanly via helm:
helm repo add cnpg https://cloudnative-pg.github.io/charts
helm upgrade --install cnpg cnpg/cloudnative-pg \
--namespace cnpg-system \
--create-namespace
the operator runs in its own namespace and watches all namespaces by default. it installs the Cluster, Backup, ScheduledBackup, Pooler, and related crds. once it’s running, postgres clusters are deployed by creating Cluster resources in the application’s namespace.
defining a cluster
a minimal production cluster looks like this:
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: example
namespace: production
spec:
instances: 3
storage:
size: 50Gi
storageClass: gp3
postgresql:
parameters:
shared_buffers: "256MB"
effective_cache_size: "768MB"
max_connections: "200"
work_mem: "4MB"
maintenance_work_mem: "64MB"
wal_buffers: "8MB"
random_page_cost: "1.1"
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
memory: "2Gi"
monitoring:
enablePodMonitor: true
three instances results in one primary and two streaming replicas. cnpg handles the replication topology automatically - the primary accepts writes, replicas stream wal segments and serve reads. the operator continuously monitors replication lag and instance health.
the storageClass should be a csi driver that supports volume snapshots for fast backups. gp3 on aws, premium-ssd-v2 on azure, pd-ssd on gcp. network-attached block storage with high latency variance should be avoided - postgres is sensitive to i/o latency on wal writes.
the postgresql parameters block maps directly to postgresql.conf. cnpg does not abstract over postgres configuration. standard postgres tuning applies, and the operator handles rolling restarts when settings change.
bootstrapping and initialization
new clusters start empty by default. cnpg also supports bootstrapping from a backup, an existing cluster, or a sql script.
bootstrap from a sql file:
spec:
bootstrap:
initdb:
database: example
owner: example
postInitApplicationSQL:
- CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
- CREATE EXTENSION IF NOT EXISTS pgcrypto;
bootstrap from a backup - useful for cloning production to staging:
spec:
bootstrap:
recovery:
source: example-backup
recoveryTarget:
targetTime: "2025-05-09T14:00:00Z"
the recoveryTarget enables point-in-time recovery. cnpg restores the base backup and replays wal segments up to the specified timestamp. this allows recovery from accidental data loss to the exact moment before the problem occurred, rather than falling back to the last nightly dump.
backups
cnpg supports two backup methods: volume snapshots and object store via barman.
volume snapshots use the csi snapshot api to take a consistent point-in-time copy of the storage volume. they’re fast because they’re copy-on-write at the storage layer:
spec:
backup:
volumeSnapshot:
className: csi-snapclass
for object store backups, cnpg uses barman under the hood. it connects to an s3 bucket, azure blob container, or gcs bucket and streams base backups and wal archives continuously:
spec:
backup:
barmanObjectStore:
destinationPath: s3://database-backups/example
s3Credentials:
accessKeyId:
name: aws-credentials
key: access_key_id
secretAccessKey:
name: aws-credentials
key: secret_access_key
wal:
compression: gzip
maxParallel: 4
data:
compression: gzip
wal archiving is the critical component. base backups provide a snapshot in time. continuous wal archiving captures every transaction between snapshots. together they enable point-in-time recovery to any moment, not just the last backup window.
schedule backups with a ScheduledBackup resource:
apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
name: example-daily
spec:
schedule: "0 0 2 * * *"
backupOwnerReference: self
cluster:
name: example
method: barmanObjectStore
this schedules a base backup every night at 2am. wal segments stream continuously in between. the backupOwnerReference: self means the backup resource is garbage collected when the scheduled backup is deleted.
failover
failover traditionally requires external tooling like pacemaker or patroni. cnpg handles it natively.
the operator monitors every instance with a configurable health check. when the primary becomes unreachable - node failure, network partition, oom kill - cnpg promotes the most up-to-date replica to primary. the remaining replicas repoint their replication to the new primary. the whole process takes seconds.
to inspect the current state:
kubectl cnpg status example
the output shows the current topology, replication lag per replica, wal position, and timeline. during a failover the primary switch and replica re-attachment are visible here.
fencing is built in. if a failed instance comes back, cnpg doesn’t blindly reattach it. the old primary might have transactions that diverged from the new timeline. cnpg fences the instance and either rebuilds it from the new primary’s data or requires manual intervention, depending on how far it diverged.
manual switchovers for maintenance are also supported:
kubectl cnpg promote example example-2
this specifies the cluster and the target instance. the operator orchestrates a clean switchover - draining connections, waiting for replication to catch up, promoting, and reconfiguring.
connection pooling
most production postgres deployments need connection pooling. application frameworks that open a connection per request will exhaust max_connections quickly, and raising that limit trades memory for concurrency in a way that doesn’t scale.
cnpg includes a built-in pgbouncer integration via the Pooler resource:
apiVersion: postgresql.cnpg.io/v1
kind: Pooler
metadata:
name: example-pooler-rw
spec:
cluster:
name: example
instances: 2
type: rw
pgbouncer:
poolMode: transaction
parameters:
default_pool_size: "25"
max_client_conn: "200"
separate poolers can be created for read-write (rw) and read-only (ro) traffic. the rw pooler routes to the primary. the ro pooler load-balances across replicas. applications connect to the pooler service instead of the database directly.
transaction-mode pooling works for most workloads. it assigns a server connection for the duration of a transaction, then returns it to the pool. session mode ties a connection to a client session, which limits the benefit of pooling for short-lived connections.
tls and authentication
cnpg generates tls certificates automatically for every cluster. all connections between instances use mutual tls. client connections can be configured to require tls as well:
spec:
postgresql:
pg_hba:
- hostssl all all all scram-sha-256
this replaces the default pg_hba.conf and forces all client connections to use tls with scram authentication. cnpg stores the generated certificates in kubernetes secrets, which application pods can mount.
for service-to-service authentication inside the cluster, the operator creates secrets with the connection credentials:
kubectl get secret example-app -o jsonpath='{.data.uri}' | base64 -d
this returns a connection string the application can consume directly. the secret is updated automatically when credentials rotate.
monitoring
cnpg ships with built-in prometheus metrics. setting enablePodMonitor: true in the cluster spec creates PodMonitor resources that prometheus picks up automatically when the prometheus operator is installed.
the exposed metrics cover the key areas of postgres monitoring:
- replication lag per replica, in bytes and seconds
- transaction rates - commits, rollbacks, conflicts
- buffer cache hit ratios
- connection counts by state - active, idle, waiting
- wal generation rate and archival status
- backup status - last successful backup, age, wal archiving lag
replication lag is typically the most important metric to alert on. if a replica falls behind, it may be under-resourced, the network between nodes may be saturated, or the storage can’t keep up with wal writes. the alert threshold should match the target rpo - for example, if the rpo is 30 seconds, alert when lag exceeds 10.
cnpg also exposes a json status endpoint on each instance. the kubectl cnpg status command provides a single-pane view of the cluster, which can be referenced in runbooks for on-call assessments.
useful prometheus queries:
# replication lag in seconds
cnpg_pg_replication_lag{cluster="example"}
# backup age - alert if older than 25 hours
time() - cnpg_collector_last_available_backup_timestamp{cluster="example"}
# cache hit ratio - should be above 0.99 (requires custom monitoring query)
cnpg_pg_stat_database_blks_hit / (cnpg_pg_stat_database_blks_hit + cnpg_pg_stat_database_blks_read)
upgrades and minor version bumps
minor version upgrades - 16.9 to 16.10, for example - are handled by changing the image tag in the cluster spec:
spec:
imageName: ghcr.io/cloudnative-pg/postgresql:16.10
cnpg performs a rolling update: it upgrades replicas first, one at a time, verifies replication health, then does a switchover to a replica and upgrades the old primary last. zero downtime if the application handles reconnects - the pooler absorbs the brief interruption during switchover.
major version upgrades - 15 to 16 - require pg_upgrade and are more involved. cnpg supports this through the bootstrap-from-backup flow: spin up a new cluster on the target version, bootstrap it from a backup of the old cluster, verify, and switch traffic.
what to watch out for
storage latency. postgres performance is directly tied to storage latency. wal writes are synchronous - every commit waits for the wal flush. storage with tail latency spikes will cause transaction latency spikes. benchmarking the storage class with pgbench before use is recommended.
memory tuning. shared_buffers should be about 25% of available memory. effective_cache_size should be about 75%. cnpg does not tune these automatically - they need to be set based on the instance’s resource limits.
failover testing. deleting a primary pod in staging and observing the promotion helps validate the failover path. measuring promotion time and verifying application reconnects are both worth doing before relying on failover in production.
wal archiving failures. if wal archiving to the object store stops working, the point-in-time recovery window stops growing. the base backup might still be valid, but recovery is limited to the last archived wal segment. monitoring cnpg_collector_pg_stat_archiver_failed_count and alerting on any non-zero value catches this early.
pod disruption budgets. cnpg creates pdbs automatically. during node maintenance, kubernetes needs to know it can’t evict all database pods simultaneously. the default pdb allows one instance to be unavailable at a time - verify this matches the expected maintenance behavior.
references
[1] cloudnativepg documentation. “architecture.”
cloudnative-pg.io/documentation/current/architecture
[2] cloudnativepg documentation. “backup and recovery.”
cloudnative-pg.io/documentation/current/backup
[3] cloudnativepg documentation. “connection pooling.”
cloudnative-pg.io/documentation/current/connection_pooling
[4] cloudnativepg documentation. “monitoring.”
cloudnative-pg.io/documentation/current/monitoring
[5] cloudnativepg documentation. “failover.”
cloudnative-pg.io/documentation/current/failover
[6] cloudnativepg documentation. “bootstrap.”
cloudnative-pg.io/documentation/current/bootstrap
[7] postgresql documentation. “wal configuration.”
postgresql.org/docs/current/wal-configuration.html