The Certificate Expiry Problem
SSL/TLS certificate expiry is one of the most embarrassing and preventable causes of production outages. In 2023, major outages caused by expired certificates affected organisations including banks, telecoms, and government services — despite certificates providing weeks of advance warning before expiry. The problem is not awareness; it is process. When certificate renewal is a manual task owned by no one in particular, and the renewal reminder email arrives in a shared inbox that everyone assumes someone else monitors, expiries happen. The solution is automation — and with modern tooling, it is entirely achievable.
Understanding the TLS Certificate Lifecycle
A TLS certificate binds a public key to a domain name (or set of domain names via Subject Alternative Names) and is signed by a Certificate Authority that browsers and operating systems trust. The certificate has a validity period — historically one to two years, now a maximum of 398 days for publicly trusted certificates, and Google has proposed reducing this to 90 days by 2026. The certificate lifecycle involves: key pair and CSR generation, submission to a CA with domain validation (DV), organisation validation (OV), or extended validation (EV), issuance of the signed certificate, installation on the web server or load balancer, and renewal before expiry. Automating this lifecycle requires CA support for a machine-readable issuance protocol — which is what ACME provides.
Let's Encrypt and the ACME Protocol
Let's Encrypt, operated by the Internet Security Research Group (ISRG), is a free, automated CA that issues Domain Validated certificates via the ACME protocol (RFC 8555). Over 300 million websites use Let's Encrypt certificates, making it the world's largest CA by issuance volume. ACME works by proving domain ownership through one of two challenge types:
- HTTP-01: The ACME client places a specific file at http://yourdomain.com/.well-known/acme-challenge/TOKEN. The CA fetches it to verify you control the domain's web server. Requires port 80 to be accessible. Does not work for wildcard certificates.
- DNS-01: The ACME client creates a specific TXT record at _acme-challenge.yourdomain.com. The CA verifies it via DNS lookup. Works for wildcard certificates and does not require port 80 access — ideal for internal services. Requires programmatic DNS access (most major DNS providers including Cloudflare, Route 53, and Azure DNS support API-driven TXT record creation).
The leading ACME clients are Certbot (the official Let's Encrypt client, ideal for standalone servers), acme.sh (shell-script based, excellent for cPanel/Plesk environments), and cert-manager (Kubernetes-native, covered below). Let's Encrypt certificates are valid for 90 days — the short lifetime encourages automation and reduces the window of exposure for compromised certificates.
cert-manager on Kubernetes
For Kubernetes environments, cert-manager is the standard solution for automated certificate lifecycle management. It runs as a controller in your cluster, watches Certificate resources, requests certificates from Let's Encrypt (or other ACME CAs, Vault PKI, or commercial CAs), stores them as Kubernetes Secrets, and automatically renews them 30 days before expiry.
A basic cert-manager setup with Let's Encrypt involves creating a ClusterIssuer resource that defines the ACME server URL, your email for expiry notifications, and the challenge solver configuration. Any Ingress resource annotated with cert-manager.io/cluster-issuer: letsencrypt-prod automatically gets a TLS certificate provisioned and maintained by cert-manager. This is infrastructure-as-code certificate management — no manual renewals, no expiry surprises, and no shared calendar reminders cluttering engineering inboxes.
Certificate Monitoring
Even with automation in place, you need monitoring as a safety net. Certificates in automated pipelines can still fail to renew if the ACME challenge fails (DNS propagation issues, firewall changes blocking port 80), if the cert-manager pod crashes, or if a certificate exists outside the automated pipeline on a legacy server, CDN, or load balancer configured manually. Monitor your certificate estate with:
- Prometheus blackbox_exporter: The ssl probe module checks TLS certificate validity and expiry date for any HTTPS endpoint. Configure alerts for certificates expiring in fewer than 30 days and certificates with invalid chains or broken trust.
- Grafana dashboard: Visualise certificate expiry dates across all monitored domains with a traffic-light colour scheme (green for more than 30 days remaining, yellow for 10–30 days, red for fewer than 10 days).
- External monitoring services: Use StatusCake or UptimeRobot to check certificate validity from external networks — this catches cases where internal monitoring misses an externally-presented certificate issue due to internal DNS resolution differences.
Private PKI for Internal Services
For internal services that do not need publicly trusted certificates — internal APIs, database connections, service mesh mTLS — consider operating your own internal CA using HashiCorp Vault PKI Secrets Engine or step-ca. This enables short-lived certificates (hours rather than months) for internal services, dramatically reducing the window of exposure for compromised certificates, and provides full control over your CA chain without Let's Encrypt rate limits or public DNS exposure requirements. PCCVDI Solutions deploys Vault PKI for internal mTLS in Kubernetes environments running Istio or Linkerd service meshes — cert-manager integrates directly with Vault PKI as an issuer, providing a consistent certificate management interface across both public and private certificate issuance.