Platform Reliability

Health checks, session refresh, baseline admin API protection, and crash recovery for CE Pro.

IntermediateownermanagerdeveloperUpdated 2026-04-28

Platform Reliability

CE Pro now includes a shared baseline for session refresh, admin API protection, health checks, and app-level crash recovery.

These platform controls are not a replacement for route-level auth, validation, or product-specific permissions. They exist so the app behaves more predictably as traffic grows.

Production Telemetry Baseline

Phase 0 scaling work now starts with production telemetry, not synthetic load tests.

Before CE Pro changes database access patterns, adds projections, provisions replicas, or raises infrastructure spend, the team must capture a production baseline that shows:

the top 10 slow app routes by p95 and p99
the top 20 slow SQL statements from pg_stat_statements
the current Supabase connection snapshot, including total, active, idle, idle-in-transaction, waiting, and max_connections

The app collects sampled browser navigation and same-origin API timing events in production. Sampling defaults to 5% on public/non-admin routes, while authenticated admin pages default to 20% so workhorse-route regressions show up faster. Override non-admin sampling with NEXT_PUBLIC_ROUTE_TELEMETRY_SAMPLE_RATE, override admin sampling with NEXT_PUBLIC_ROUTE_TELEMETRY_ADMIN_SAMPLE_RATE, and disable client-side collection with NEXT_PUBLIC_ROUTE_TELEMETRY_ENABLED=false. The ingestion endpoint can also be shut off server-side with ROUTE_TELEMETRY_INGEST_DISABLED=true. Telemetry ingest uses the in-memory limiter so collecting route timings does not spend another database RPC before writing the actual sample.

Telemetry writes land in route_performance_events. The export command reads service-role-only RPCs for route timing, pg_stat_statements, and connection usage:

npm run telemetry:production -- --days=7 --out=reports/production-telemetry.md

The Phase 0 stop condition is strict: if production telemetry shows the current SLOs are already met at the current measured load, halt Phases 1-6 and re-evaluate quarterly instead of making speculative architecture changes.

The business-load target is also a hard prerequisite before optimization work beyond telemetry: the business owner must write down N simultaneous authenticated active sessions by date D, justified by the launch, customer, or usage reason. Without that number, scale work has no anchor.

Cost gates stay explicit. Any phase that would push monthly infrastructure spend above $250 needs Clean Estimate Pro business-owner approval before it starts. A higher ceiling is only unlocked when current MRR supports it: $500/month after $10k MRR, $1,000/month after $25k MRR, or a written launch-risk exception from the business owner.

Auth and Worker Backpressure

When Supabase Auth or PostgREST starts returning gateway timeouts, the sign-in screen now fails visibly instead of leaving the user stuck on Signing in.... Password sign-in waits up to 15 seconds and then shows a retryable auth-service message.

The scheduled worker routes also include incident controls so database pressure can be reduced without changing user-facing app code:

DISABLE_CRON_JOBS=true pauses the full scheduled-work layer after each route authenticates and before it opens a Supabase connection.
DISABLE_SCHEDULED_WORK=true is an equivalent global scheduler pause switch for incidents where the word "cron" is too narrow.
DISABLE_BACKGROUND_JOB_CRON=true pauses only /api/cron/process-background-jobs.
DISABLE_WEBHOOK_DELIVERY_CRON=true pauses only /api/cron/process-webhook-deliveries.
DISABLE_FOLLOW_UP_CRON=true pauses only /api/cron/process-follow-ups.
DISABLE_DRIP_CAMPAIGN_CRON=true pauses only /api/cron/process-drip-campaigns.
DISABLE_EXPIRE_PROPOSALS_CRON=true pauses only /api/cron/expire-proposals.
DISABLE_AUTOMATION_WAITING_CRON=true pauses only /api/automations/engine/process-waiting.
DISABLE_AUTOMATION_SCHEDULED_TRIGGER_CRON=true pauses only /api/automations/engine/process-scheduled-triggers.
DISABLE_FRANCHISE_SUMMARY_CRON=true pauses only /api/cron/refresh-franchise-summary.
BACKGROUND_JOB_CRON_BATCH_SIZE and BACKGROUND_JOB_CRON_CONCURRENCY can raise or lower the background worker from its conservative defaults of 5 jobs and 1 concurrent dispatch.
WEBHOOK_DELIVERY_CRON_BATCH_SIZE can raise or lower webhook delivery claiming from its conservative default of 10 deliveries.

The public ingress rate-limit RPC is also lock-bounded. If one hot key is already being updated, the limiter fails closed quickly instead of holding PostgREST workers on a row lock until the platform times out the request.

If Supabase warns that the project is about to deplete or has depleted its Disk IO Budget, set the global scheduler pause first, redeploy so every route has the guard, avoid super-admin/dashboard refreshes, then either wait for the next daily IO budget replenishment or temporarily upgrade compute while the team captures pg_stat_statements and fixes the disk-heavy queries. Compute add-ons require the Supabase organization to be on a paid plan; free-tier projects must be upgraded to a paid Supabase plan before the emergency compute add-on can be applied.

Super-admin pages are dynamic-only so protected platform dashboards never block a Vercel build on live Supabase reads.

Admin Alert Backpressure

The admin alert bell is a read path. It should not regenerate managed alerts or write to ai_alerts on every poll. Managed alert generation is reserved for the dashboard alert scope, is throttled by ADMIN_ALERT_SYNC_TTL_MS with a five-minute default, and skips unchanged rows so realtime subscriptions do not create read/write feedback loops.

If the admin shell feels slow, check telemetry_route_performance_summary for /api/admin/alerts and telemetry_pg_stat_statements_top for ai_alerts updates. A healthy alert path should show reads and occasional inserts or updates, not continuous update traffic while the user is idle.

The admin shell keeps ordinary page navigation on the lightest available read path.

Current controls:

Middleware and server org-context resolution skip the franchise descendant lookup when the selected active org is already a direct user membership.
The sidebar gamification badge is read-only during normal page loads and no longer seeds catalogs, upserts reward rows, recalculates profiles, or records daily-login XP from the shared layout.
Admin API middleware rate limiting uses the in-memory limiter for /api/admin/* traffic so ordinary schedule, client, and dashboard reads do not spend a database RPC before the route handler authenticates and authorizes the request.
The Clients list avoids an unused exact count on the standalone contact query and narrows client-health work to the visible client ids when possible.
The Clients page shell no longer blocks server navigation on the heavy account tree API. The page opens with a loading state and hydrates the account/client rows through the existing client-side endpoint.
The Estimates page shell follows the same pattern. It no longer waits on server-side calls back into /api/admin/estimates and /api/admin/team; the browser hydrates the first list and team filters after the shell is interactive.
The Estimates list API keeps exact visible totals and uses a short-lived cached summary for type/source metadata so the workbench does not show approximate counts.
Dashboard first-load metrics keep exact visible totals, use narrower reads, and cache the expensive chart RPC for five minutes per workspace/month range.
Schedule reference data for crews, team members, trucks, vehicles, services, and crew-pay policies is cached briefly per workspace so repeat visits do not re-run the same setup reads.

If most admin pages take multiple seconds to open, first check telemetry_pg_stat_statements_top for get_org_and_descendants, check_public_rate_limit, user_gamification_profile, rewards, and get_client_health. Those should not climb quickly during light single-user navigation.

Report And Dashboard Accuracy

Admin dashboards and analytics reports must not use approximate counts, sampled data, hidden row caps, or whole-dollar rounding for user-facing business totals.

Current accuracy controls:

Revenue Analytics loads accepted residential, fleet, commercial building, and holiday lights revenue in paginated batches so it does not silently stop at the platform row limit.
Service mix, sales-cycle, geography, retention, operations, schedule, lead-source ROI, holiday-lights, franchise, and margin reports page through their report inputs instead of relying on the default PostgREST window.
Owner, residential, sales, commercial, fleet, holiday-lights, commissions, crew-pay, client, franchise, and pipeline dashboard totals use exact source rows or exact database counts for visible KPIs.
Schedule calendar metrics now page through the selected range of jobs and sales follow-up events before calculating job counts, scheduled revenue, and average ticket.
User-facing money totals display to the cent across admin dashboards, reports, job details, schedule views, pipeline cards, commissions, crew pay, client LTV/open pipeline, revenue charts, and exports.
Report money is summed as integer cents where precision matters before values are converted back for display/export.
If a report source query fails, the route should fail visibly instead of showing a partial total or fake zero as if it were complete.

Forecasts and estimates are still allowed where the workflow is explicitly predictive, such as pipeline forecast scenarios or modeled margin cost. Those surfaces must be labeled as forecasts or estimated costs and must not be presented as actual revenue, payment, or performance history.

Root Session Proxy

Protected app surfaces now pass through a shared root session proxy entrypoint instead of relying on scattered session refresh logic.

That middleware covers:

Admin routes
Super-admin routes
Onboarding routes
Logged-in customer routes
Selected public customer-link routes that already use signed tokens

The goal is simple: keep authenticated browser sessions fresher and reduce the "randomly got sent back to login" class of failure during normal admin use.

For admin workspace resolution, the request-scoped org hint is now treated as a fast-path hint rather than trusted authority. The app still validates that the hinted active org is actually reachable from the selected membership before it reuses that context.

The same root session proxy also refreshes browser cookies on mixed-auth API surfaces that can be reached from staff workflows, including:

/api/email
/api/sms
/api/estimates/*
/api/holiday-lights/*

Those routes still enforce their own route-level auth or customer-token rules. The proxy only refreshes an existing browser session when one is present.

Inbound webhook routes are still intentionally excluded from the root session proxy because they authenticate with provider signatures, scoped tokens, or internal secrets instead of browser cookies.

The proxy now also strips the internal request-scoped org hint headers from all inbound traffic before forwarding the request deeper into the app. Those headers are only re-applied after the admin session is resolved inside the trusted proxy path, which closes the spoofing gap where an external caller could try to inject active-org context directly.

Admin API Baseline Rate Limiting

All /api/admin/* routes now inherit a baseline middleware rate limit of 60 requests per minute before any route-specific limits apply.

Important notes:

This is a shared floor, not a replacement for tighter limits on sensitive endpoints.
Existing route-level auth and permission checks still matter.
The middleware limiter uses in-memory protection for admin API traffic to keep protected office reads from paying an extra database RPC. Public and token-based surfaces still use the shared database-backed limiter unless their route opts into a different policy.
The middleware now buckets admin API traffic by authenticated user when one is available, and falls back to an IP-scoped key only when the caller has no usable session. That avoids one fast admin user starving everyone else behind the same office network.

Health Endpoint

CE Pro now exposes:

GET /api/health
HEAD /api/health

Use this endpoint for uptime monitors, platform health probes, and deployment checks.

The check verifies:

the app can boot with the required runtime environment
Supabase is reachable from the app runtime

Responses are intentionally cheap and uncached.

The health path now also uses bounded timeouts for both the dependency check and the optional internal queue snapshot. The Supabase reachability query carries its own abort signal, so a slow dependency check is cancelled at the timeout boundary instead of leaving an upstream request running after the endpoint has already degraded. If Supabase is slow or the queue snapshot cannot complete quickly, the endpoint degrades fast instead of hanging until the platform kills the request.

Healthy Response

{
  "status": "ok",
  "checked_at": "2026-03-21T15:00:00.000Z",
  "services": {
    "app": "ok",
    "supabase": "ok"
  }
}

Degraded Response

If the app can respond but a required dependency check fails, CE Pro returns 503 with a degraded status.

Startup Environment Validation

The shared Supabase bootstrap path now validates its required environment variables from one place instead of silently failing later in request handling.

That baseline validation currently covers the critical Supabase runtime config:

NEXT_PUBLIC_SUPABASE_URL
NEXT_PUBLIC_SUPABASE_ANON_KEY
SUPABASE_SERVICE_ROLE_KEY

If one of those values is missing or malformed, the app now fails loudly instead of limping into production with partial behavior.

Global Error Recovery

CE Pro now includes app-wide error boundaries for route-level and full-app crashes.

That means a render failure should now fall back to a recovery screen with a retry action instead of dropping the user onto a blank white page with no path forward.

Route-specific error boundaries can still exist where a workflow needs a more tailored recovery state.

The shared fallback also routes admins, logged-in customers, and public visitors back to the right surface instead of always sending everyone to the admin dashboard.

Background Delivery Jobs

CE Pro now moves more slow outbound delivery work off the request path and onto the shared background jobs queue.

That queue now handles:

one-time campaign fan-out batches
manual admin inbox email sends
commercial fleet and building proposal sends, including PDF generation, email, SMS, and delivery-state updates
forgot-password recovery emails
fresh team setup-link emails
self-serve account password reset emails from Settings

The main effect is that the app can acknowledge the request faster while the minute-by-minute job worker handles the actual send work in the background.

Queued email-style delivery work now also gets retry headroom instead of acting like a one-shot fire-and-forget handoff.

That means:

forgot-password sends retry on transient provider failures
setup-link and account-reset sends retry instead of silently dead-lettering after one blip
queued admin inbox emails retry before they are marked failed

What Users See

Marketing sends now queue batches instead of holding the browser open while every recipient is processed inline.
Manual admin emails show as queued instead of pretending delivery already finished.
Commercial proposal sends now show as queued while the background worker renders the PDF, sends email/SMS, and records delivery state. Set COMMERCIAL_PROPOSAL_SEND_QUEUE_DISABLED=true to roll this path back to synchronous delivery.
Password recovery and setup-link requests now rely on the background worker instead of waiting on synchronous provider delivery inside the request.

Hot-Read Server Caching

Phase 2 also adds short-lived server-side caching to a few high-traffic office reads so common dashboard and setup screens stop re-running the same database work on every load.

The current cache coverage includes:

admin dashboard metrics
services catalog reads
add-on structure reads
lead source reads and lead-source lookup helpers
pricing config reads
tag summary reads

These caches are intentionally short and org-scoped.

Important notes:

Admin reference-data writes now invalidate their matching cache tags.
Cached responses still keep the existing browser cache headers where those were already part of the route contract.
Dashboard data currently relies on TTL-based refresh rather than write-trigger invalidation, which keeps the implementation safe while still cutting read pressure.
Lead-source admin reads also now avoid selecting a non-existent updated_at column on workspaces where that field is absent, which closes the stress-test 500 that was keeping the lead-source cache path from helping at all.

Shared Org Cache

The short-lived server caches now use a shared org-cache abstraction with standard bucket types instead of one-off inline cache definitions.

That makes it easier to review:

cache scope
TTL intent
org isolation
invalidation tags

The current buckets in use are:

reference
dashboard
settings

RSC-First Admin Lists

The Jobs and Invoices workspaces now load their first page on the server and hydrate into client-side query state instead of waiting for a mount-time waterfall before anything useful appears.

That means:

first paint is more useful on those pages
the first page is already present when the client boots
follow-up pagination and refreshes reuse the same query path

Query-Backed Pagination

The Jobs and Invoices admin workspaces also moved off the old fixed 200 row cap and onto explicit paginated list APIs.

That reduces overfetching and gives those pages a clearer path to larger-office datasets.

Important note:

Some list-level summary cards still intentionally describe the loaded slice, not the full matching dataset, unless the page explicitly labels otherwise.
Jobs and Invoices search and status filtering now run through the paginated query path too, so records do not appear to vanish just because they live beyond the first loaded page.
Jobs and Invoices now sanitize free-text search terms before they are interpolated into the PostgREST filter DSL, which avoids malformed search expressions on punctuation-heavy queries.
Jobs and Invoices now also have dedicated (org_id, api_environment, created_at DESC) indexes for the primary admin list path, so those pages benefit from the same scaling work instead of relying on less-specific secondary indexes.
Jobs and Invoices now also treat out-of-range page offsets as empty pages instead of surfacing a server error when a workspace does not have enough rows to satisfy the requested offset yet.
That out-of-range protection now matches the real PostgREST response shape too, including the 416 Range Not Satisfiable case where the HTTP status is returned alongside the error payload instead of inside it.

Super-Admin Browser QA

The repo now includes a read-only super-admin portal browser QA script:

npm run qa:super-admin -- --base-url=https://cleanestimate.pro --magic-link

Run it with SUPER_ADMIN_EMAIL, NEXT_PUBLIC_SUPABASE_URL, and SUPABASE_SERVICE_ROLE_KEY available in the environment. The script verifies that signed-out users are redirected away from /super-admin, signs in through a one-time Supabase magic link for the configured super-admin identity, opens the Dashboard, Organizations, organization detail, Users, Waitlist, and Activity surfaces, and records screenshots plus report.json under output/playwright/super-admin-portal-qa/.

The super-admin Users and Waitlist tables now use deterministic date rendering during hydration so browser QA does not surface React text-mismatch errors on those operational pages. The production CSP also allows the Google Analytics collect endpoint currently used by the deployed tag, which prevents analytics CSP noise from hiding real super-admin portal failures.

The repo now also includes a live self-serve signup and onboarding browser QA script:

npm run qa:signup -- --base-url=https://cleanestimate.pro

Run it with AGENTMAIL_API_KEY, NEXT_PUBLIC_SUPABASE_URL, SUPABASE_SERVICE_ROLE_KEY, and INTERNAL_API_SECRET available in the environment. The script creates a fresh AgentMail inbox, signs up a new owner and organization through /signup, completes the Company Info, Services, Pricing, and Get Started onboarding steps, verifies the new owner can reach /admin, checks the database state for the owner/org/membership/trial, confirms password login from a clean browser session, then sends and opens a real owner access-link email delivered to AgentMail.

Evidence is written under output/playwright/signup-onboarding-qa/ with full-page screenshots and report.json.

The admin dashboard's Lead Sources shortcut now points at the live /admin/analytics/lead-source-roi report instead of prefetching a missing analytics route, which keeps browser QA console output focused on actual signup and auth failures.

Invoice Number Allocation

Invoice creation now allocates invoice numbers through one shared atomic counter path across:

admin single-invoice creation
admin bulk invoice creation
API v1 invoice creation

That removes the older count-then-insert race where concurrent creates could reach the same invoice number under load.

The allocator also seeds itself from any already-existing invoice numbers for the same org and date, so rollout does not break on workspaces that created invoices before the counter-backed path existed.

Phase 4 Audit Follow-Up

The post-Phase-4 review also shipped a few correctness fixes that are easy to miss if you only look at the higher-level roadmap:

lead-source reference reads now respect the includeAll=false active-only filter again
server-hydrated Jobs and Invoices data now marks the initial React Query payload as fresh instead of immediately re-fetching on mount
the background-job cron now claims a smaller bounded batch per run so slow workers are less likely to hit the serverless timeout and leave claimed jobs waiting for stale-lock recovery

Structured Logging And Queue Health

The background job processor and internal delivery workers now emit structured log events instead of free-form console output.

Those logs make it easier to separate:

healthy empty worker runs
claimed batch runs
retrying jobs
dead-letter failures

The structured logger now also avoids writing full recovery-email and admin-email recipient addresses into production logs. Those logs keep enough masked recipient detail for debugging without leaving plaintext inbox addresses in the log stream.

The health endpoint also supports an internal queue snapshot when called with the same bearer secret used by the cron worker.

Use the dedicated Operations Runbook for alert thresholds, queue checks, and restore-readiness expectations.

Use the dedicated Stress Testing guide for the repeatable 1K-user readiness drill, default session mix, cookie-pool setup, and pass/fail thresholds.

If you are moving CE Pro to a new hosted Supabase project or region, use the dedicated Supabase Region Migration guide before switching traffic.

Queue Correctness Follow-Up

The background job dedupe layer now only blocks duplicate work while a matching job is still in an active queue state:

pending
processing
retrying

Completed, failed, and dead-letter jobs no longer permanently reserve the same dedupe key, which means legitimate follow-up queueing can happen again after the earlier work has fully finished.

Campaign dispatch stat reconciliation also now runs through a database-side sync step instead of an app-side read-then-write cycle. That removes the race where multiple finishing workers could overwrite each other's campaign status or message totals.

The campaign batch worker now isolates failures at the recipient level instead of aborting the whole batch on the first provider error. A single transient email failure no longer prevents the rest of the batch from being attempted, and the stats sync step still runs after the batch work completes.

The background-job processor also now reports jobs that got stuck before they could be finalized as a separate stuck_processing outcome instead of counting them as dead-letter work they never actually reached.

Recovery Email Burst Protection

Forgot-password requests now rate-limit by normalized email address instead of only by source IP.

That means:

rotating IPs or VPN exits no longer bypass the per-address recovery limit
rapid duplicate forgot-password requests collapse onto one queued recovery job instead of stacking multiple identical email sends

Permission Hardening Follow-Up

The Phase 3 admin-route hardening pass now keeps read and destructive permissions separated on the detail routes that were still inconsistent.

The two practical outcomes are:

estimate delete operations now require estimates.manage instead of slipping through the same top-level gate used for estimate reads
lead detail reads now use the view-side client permission instead of the stronger write tuple used for lead mutation routes
invoice attachment reads and writes now follow billing permissions instead of estimate permissions, so billing-only roles can manage invoice files without inheriting estimate access and estimate-only roles do not gain invoice attachment visibility by accident

That keeps destructive behavior aligned with the route intent while avoiding unnecessary lead-detail lockouts for read-capable office roles.

Log Redaction Follow-Up

The Phase 5 structured logging pass now masks email addresses on the enqueue-failure paths too, not just in the background workers that actually deliver mail.

That means recovery-email, setup-link, account-reset, and admin-email queue failures no longer write plaintext recipient addresses into structured logs while operators are debugging a queue outage.

API V1 CORS Cache Safety

The API v1 response helper now only emits Vary: Origin when it also emitted an Access-Control-Allow-Origin header.

That makes cached server-to-server responses less likely to interfere with later browser requests on installations that put an edge cache in front of the external API surface.

Was this article helpful?

Still need help? Contact support

Platform Reliability

Platform Reliability

Production Telemetry Baseline

Auth and Worker Backpressure

Admin Alert Backpressure

Admin Navigation Read Pressure

Report And Dashboard Accuracy

Root Session Proxy

Admin API Baseline Rate Limiting

Health Endpoint

Healthy Response

Degraded Response

Startup Environment Validation

Global Error Recovery

Background Delivery Jobs

What Users See

Hot-Read Server Caching

Shared Org Cache

RSC-First Admin Lists

Super-Admin Browser QA

Invoice Number Allocation

Phase 4 Audit Follow-Up

Structured Logging And Queue Health

Queue Correctness Follow-Up

Recovery Email Burst Protection

Permission Hardening Follow-Up

Log Redaction Follow-Up

API V1 CORS Cache Safety

Related articles

Platform Reliability

Platform Reliability

Production Telemetry Baseline

Auth and Worker Backpressure

Admin Alert Backpressure

Admin Navigation Read Pressure

Report And Dashboard Accuracy

Root Session Proxy

Admin API Baseline Rate Limiting

Health Endpoint

Healthy Response

Degraded Response

Startup Environment Validation

Global Error Recovery

Background Delivery Jobs

What Users See

Hot-Read Server Caching

Shared Org Cache

RSC-First Admin Lists

Query-Backed Pagination

Super-Admin Browser QA

Signup & Onboarding Browser QA

Invoice Number Allocation

Phase 4 Audit Follow-Up

Structured Logging And Queue Health

Queue Correctness Follow-Up

Recovery Email Burst Protection

Permission Hardening Follow-Up

Log Redaction Follow-Up

API V1 CORS Cache Safety

Related articles