Operations Runbook

Queue monitoring, structured logs, alerts, and backup/restore expectations for CE Pro.

IntermediateownermanagerdeveloperUpdated 2026-03-21

Operations Runbook

CE Pro now ships with a more explicit operating model for queue health, structured logs, and restore readiness.

Structured Logs

High-value API and worker paths now emit structured JSON logs instead of ad hoc text-only console lines.

The most useful event families are:

  • background_job_cron.*
  • background_job_batch.*
  • background_job.retrying
  • background_job.dead_letter
  • admin_email.*
  • campaign_batch.*
  • recovery_email.*
  • health_check.degraded

These logs are designed to help you answer:

  • Did the worker run?
  • Did it claim work?
  • Is one job type failing repeatedly?
  • Are failures retrying or dead-lettering?

Queue Monitoring

CE Pro still exposes the public GET /api/health and HEAD /api/health checks for uptime probes.

Internal monitors can also call GET /api/health with Authorization: Bearer $CRON_SECRET to receive a queue snapshot in the response body.

That queue snapshot includes:

  • pending
  • retrying
  • processing
  • dead_letter
  • oldest_ready_at
  • oldest_ready_age_seconds

Use these as the baseline alerts:

  • app health degraded for more than 2 minutes
  • queue backlog stays above normal for 10 minutes
  • oldest ready job age grows beyond 5 minutes
  • dead-letter count becomes non-zero
  • background worker cron starts failing or stops running
  • webhook delivery failures spike
  • admin auth failures spike

Backup And Restore

The production app runs on Supabase-backed Postgres, so the expected backup source of truth is the managed Supabase production project.

That expectation still needs to be verified in the live Supabase dashboard. Until that verification is done, treat restore readiness as a tracked operations task, not an assumption.

Minimum standard:

  1. Automated backups enabled.
  2. Retention window documented.
  3. Restore owner named.
  4. Quarterly restore drill recorded.

Internal Runbooks

The repo also carries internal runbooks for the operating team:

  • docs/incident-monitoring-runbook.md
  • docs/backup-and-restore-runbook.md

Use those documents for the concrete response checklist, ownership fields, and restore-drill procedure.

Stress Validation

Before bigger launches, pricing changes, marketing pushes, or other traffic-shaping releases, run the repeatable stress drill documented in Stress Testing.

That drill exists so queue, health, and admin hot-path regressions show up before customers do.

If the triggering event is a hosted Supabase project move or a region migration, run the cutover checklist and post-switch validation from Supabase Region Migration before you call the environment healthy.

Was this article helpful?

Still need help? Contact support