Operations Runbook

Queue monitoring, structured logs, alerts, and backup/restore expectations for CE Pro.

IntermediateownermanagerdeveloperUpdated 2026-03-21

Operations Runbook

CE Pro now ships with a more explicit operating model for queue health, structured logs, and restore readiness.

Structured Logs

High-value API and worker paths now emit structured JSON logs instead of ad hoc text-only console lines.

The most useful event families are:

background_job_cron.*
background_job_batch.*
background_job.retrying
background_job.dead_letter
admin_email.*
campaign_batch.*
recovery_email.*
health_check.degraded

These logs are designed to help you answer:

Did the worker run?
Did it claim work?
Is one job type failing repeatedly?
Are failures retrying or dead-lettering?

Queue Monitoring

CE Pro still exposes the public GET /api/health and HEAD /api/health checks for uptime probes.

Internal monitors can also call GET /api/health with Authorization: Bearer $CRON_SECRET to receive a queue snapshot in the response body.

That queue snapshot includes:

pending
retrying
processing
dead_letter
oldest_ready_at
oldest_ready_age_seconds

Recommended Alerts

Use these as the baseline alerts:

app health degraded for more than 2 minutes
queue backlog stays above normal for 10 minutes
oldest ready job age grows beyond 5 minutes
dead-letter count becomes non-zero
background worker cron starts failing or stops running
webhook delivery failures spike
admin auth failures spike

Backup And Restore

The production app runs on Supabase-backed Postgres, so the expected backup source of truth is the managed Supabase production project.

That expectation still needs to be verified in the live Supabase dashboard. Until that verification is done, treat restore readiness as a tracked operations task, not an assumption.

Minimum standard:

Automated backups enabled.
Retention window documented.
Restore owner named.
Quarterly restore drill recorded.

Internal Runbooks

The repo also carries internal runbooks for the operating team:

docs/incident-monitoring-runbook.md
docs/backup-and-restore-runbook.md

Use those documents for the concrete response checklist, ownership fields, and restore-drill procedure.

Stress Validation

Before bigger launches, pricing changes, marketing pushes, or other traffic-shaping releases, run the repeatable stress drill documented in Stress Testing.

That drill exists so queue, health, and admin hot-path regressions show up before customers do.

If the triggering event is a hosted Supabase project move or a region migration, run the cutover checklist and post-switch validation from Supabase Region Migration before you call the environment healthy.

Was this article helpful?

Still need help? Contact support

Operations Runbook

Operations Runbook

Structured Logs

Queue Monitoring

Recommended Alerts

Backup And Restore

Internal Runbooks

Stress Validation

Related articles