Platform Reliability
Health checks, session refresh, baseline admin API protection, and crash recovery for CE Pro.
Platform Reliability
CE Pro now includes a shared baseline for session refresh, admin API protection, health checks, and app-level crash recovery.
These platform controls are not a replacement for route-level auth, validation, or product-specific permissions. They exist so the app behaves more predictably as traffic grows.
Production Telemetry Baseline
Phase 0 scaling work now starts with production telemetry, not synthetic load tests.
Before CE Pro changes database access patterns, adds projections, provisions replicas, or raises infrastructure spend, the team must capture a production baseline that shows:
- the top
10slow app routes byp95andp99 - the top
20slow SQL statements frompg_stat_statements - the current Supabase connection snapshot, including total, active, idle, idle-in-transaction, waiting, and
max_connections
The app collects sampled browser navigation and same-origin API timing events in production. Sampling defaults to 5% on public/non-admin routes, while authenticated admin pages default to 20% so workhorse-route regressions show up faster. Override non-admin sampling with NEXT_PUBLIC_ROUTE_TELEMETRY_SAMPLE_RATE, override admin sampling with NEXT_PUBLIC_ROUTE_TELEMETRY_ADMIN_SAMPLE_RATE, and disable client-side collection with NEXT_PUBLIC_ROUTE_TELEMETRY_ENABLED=false. The ingestion endpoint can also be shut off server-side with ROUTE_TELEMETRY_INGEST_DISABLED=true. Telemetry ingest uses the in-memory limiter so collecting route timings does not spend another database RPC before writing the actual sample.
Telemetry writes land in route_performance_events. The export command reads service-role-only RPCs for route timing, pg_stat_statements, and connection usage:
npm run telemetry:production -- --days=7 --out=reports/production-telemetry.mdThe Phase 0 stop condition is strict: if production telemetry shows the current SLOs are already met at the current measured load, halt Phases 1-6 and re-evaluate quarterly instead of making speculative architecture changes.
The business-load target is also a hard prerequisite before optimization work beyond telemetry: the business owner must write down N simultaneous authenticated active sessions by date D, justified by the launch, customer, or usage reason. Without that number, scale work has no anchor.
Cost gates stay explicit. Any phase that would push monthly infrastructure spend above $250 needs Clean Estimate Pro business-owner approval before it starts. A higher ceiling is only unlocked when current MRR supports it: $500/month after $10k MRR, $1,000/month after $25k MRR, or a written launch-risk exception from the business owner.
Auth and Worker Backpressure
When Supabase Auth or PostgREST starts returning gateway timeouts, the sign-in screen now fails visibly instead of leaving the user stuck on Signing in.... Password sign-in waits up to 15 seconds and then shows a retryable auth-service message.
The scheduled worker routes also include incident controls so database pressure can be reduced without changing user-facing app code:
DISABLE_CRON_JOBS=truepauses the full scheduled-work layer after each route authenticates and before it opens a Supabase connection.DISABLE_SCHEDULED_WORK=trueis an equivalent global scheduler pause switch for incidents where the word "cron" is too narrow.DISABLE_BACKGROUND_JOB_CRON=truepauses only/api/cron/process-background-jobs.DISABLE_WEBHOOK_DELIVERY_CRON=truepauses only/api/cron/process-webhook-deliveries.DISABLE_FOLLOW_UP_CRON=truepauses only/api/cron/process-follow-ups.DISABLE_DRIP_CAMPAIGN_CRON=truepauses only/api/cron/process-drip-campaigns.DISABLE_EXPIRE_PROPOSALS_CRON=truepauses only/api/cron/expire-proposals.DISABLE_AUTOMATION_WAITING_CRON=truepauses only/api/automations/engine/process-waiting.DISABLE_AUTOMATION_SCHEDULED_TRIGGER_CRON=truepauses only/api/automations/engine/process-scheduled-triggers.DISABLE_FRANCHISE_SUMMARY_CRON=truepauses only/api/cron/refresh-franchise-summary.BACKGROUND_JOB_CRON_BATCH_SIZEandBACKGROUND_JOB_CRON_CONCURRENCYcan raise or lower the background worker from its conservative defaults of5jobs and1concurrent dispatch.WEBHOOK_DELIVERY_CRON_BATCH_SIZEcan raise or lower webhook delivery claiming from its conservative default of10deliveries.
The public ingress rate-limit RPC is also lock-bounded. If one hot key is already being updated, the limiter fails closed quickly instead of holding PostgREST workers on a row lock until the platform times out the request.
If Supabase warns that the project is about to deplete or has depleted its Disk IO Budget, set the global scheduler pause first, redeploy so every route has the guard, avoid super-admin/dashboard refreshes, then either wait for the next daily IO budget replenishment or temporarily upgrade compute while the team captures pg_stat_statements and fixes the disk-heavy queries. Compute add-ons require the Supabase organization to be on a paid plan; free-tier projects must be upgraded to a paid Supabase plan before the emergency compute add-on can be applied.
Super-admin pages are dynamic-only so protected platform dashboards never block a Vercel build on live Supabase reads.
Admin Alert Backpressure
The admin alert bell is a read path. It should not regenerate managed alerts or write to ai_alerts on every poll. Managed alert generation is reserved for the dashboard alert scope, is throttled by ADMIN_ALERT_SYNC_TTL_MS with a five-minute default, and skips unchanged rows so realtime subscriptions do not create read/write feedback loops.
If the admin shell feels slow, check telemetry_route_performance_summary for /api/admin/alerts and telemetry_pg_stat_statements_top for ai_alerts updates. A healthy alert path should show reads and occasional inserts or updates, not continuous update traffic while the user is idle.
Admin Navigation Read Pressure
The admin shell keeps ordinary page navigation on the lightest available read path.
Current controls:
- Middleware and server org-context resolution skip the franchise descendant lookup when the selected active org is already a direct user membership.
- The sidebar gamification badge is read-only during normal page loads and no longer seeds catalogs, upserts reward rows, recalculates profiles, or records daily-login XP from the shared layout.
- Admin API middleware rate limiting uses the in-memory limiter for
/api/admin/*traffic so ordinary schedule, client, and dashboard reads do not spend a database RPC before the route handler authenticates and authorizes the request. - The Clients list avoids an unused exact count on the standalone contact query and narrows client-health work to the visible client ids when possible.
- The Clients page shell no longer blocks server navigation on the heavy account tree API. The page opens with a loading state and hydrates the account/client rows through the existing client-side endpoint.
- The Estimates page shell follows the same pattern. It no longer waits on server-side calls back into
/api/admin/estimatesand/api/admin/team; the browser hydrates the first list and team filters after the shell is interactive. - The Estimates list API keeps exact visible totals and uses a short-lived cached summary for type/source metadata so the workbench does not show approximate counts.
- Dashboard first-load metrics keep exact visible totals, use narrower reads, and cache the expensive chart RPC for five minutes per workspace/month range.
- Schedule reference data for crews, team members, trucks, vehicles, services, and crew-pay policies is cached briefly per workspace so repeat visits do not re-run the same setup reads.
If most admin pages take multiple seconds to open, first check telemetry_pg_stat_statements_top for get_org_and_descendants, check_public_rate_limit, user_gamification_profile, rewards, and get_client_health. Those should not climb quickly during light single-user navigation.
Report And Dashboard Accuracy
Admin dashboards and analytics reports must not use approximate counts, sampled data, hidden row caps, or whole-dollar rounding for user-facing business totals.
Current accuracy controls:
- Revenue Analytics loads accepted residential, fleet, commercial building, and holiday lights revenue in paginated batches so it does not silently stop at the platform row limit.
- Service mix, sales-cycle, geography, retention, operations, schedule, lead-source ROI, holiday-lights, franchise, and margin reports page through their report inputs instead of relying on the default PostgREST window.
- Owner, residential, sales, commercial, fleet, holiday-lights, commissions, crew-pay, client, franchise, and pipeline dashboard totals use exact source rows or exact database counts for visible KPIs.
- Schedule calendar metrics now page through the selected range of jobs and sales follow-up events before calculating job counts, scheduled revenue, and average ticket.
- User-facing money totals display to the cent across admin dashboards, reports, job details, schedule views, pipeline cards, commissions, crew pay, client LTV/open pipeline, revenue charts, and exports.
- Report money is summed as integer cents where precision matters before values are converted back for display/export.
- If a report source query fails, the route should fail visibly instead of showing a partial total or fake zero as if it were complete.
Forecasts and estimates are still allowed where the workflow is explicitly predictive, such as pipeline forecast scenarios or modeled margin cost. Those surfaces must be labeled as forecasts or estimated costs and must not be presented as actual revenue, payment, or performance history.
Root Session Proxy
Protected app surfaces now pass through a shared root session proxy entrypoint instead of relying on scattered session refresh logic.
That middleware covers:
- Admin routes
- Super-admin routes
- Onboarding routes
- Logged-in customer routes
- Selected public customer-link routes that already use signed tokens
The goal is simple: keep authenticated browser sessions fresher and reduce the "randomly got sent back to login" class of failure during normal admin use.
For admin workspace resolution, the request-scoped org hint is now treated as a fast-path hint rather than trusted authority. The app still validates that the hinted active org is actually reachable from the selected membership before it reuses that context.
The same root session proxy also refreshes browser cookies on mixed-auth API surfaces that can be reached from staff workflows, including:
/api/email/api/sms/api/estimates/*/api/holiday-lights/*
Those routes still enforce their own route-level auth or customer-token rules. The proxy only refreshes an existing browser session when one is present.
Inbound webhook routes are still intentionally excluded from the root session proxy because they authenticate with provider signatures, scoped tokens, or internal secrets instead of browser cookies.
The proxy now also strips the internal request-scoped org hint headers from all inbound traffic before forwarding the request deeper into the app. Those headers are only re-applied after the admin session is resolved inside the trusted proxy path, which closes the spoofing gap where an external caller could try to inject active-org context directly.
Admin API Baseline Rate Limiting
All /api/admin/* routes now inherit a baseline middleware rate limit of 60 requests per minute before any route-specific limits apply.
Important notes:
- This is a shared floor, not a replacement for tighter limits on sensitive endpoints.
- Existing route-level auth and permission checks still matter.
- The middleware limiter uses in-memory protection for admin API traffic to keep protected office reads from paying an extra database RPC. Public and token-based surfaces still use the shared database-backed limiter unless their route opts into a different policy.
- The middleware now buckets admin API traffic by authenticated user when one is available, and falls back to an IP-scoped key only when the caller has no usable session. That avoids one fast admin user starving everyone else behind the same office network.
Health Endpoint
CE Pro now exposes:
GET /api/healthHEAD /api/health
Use this endpoint for uptime monitors, platform health probes, and deployment checks.
The check verifies:
- the app can boot with the required runtime environment
- Supabase is reachable from the app runtime
Responses are intentionally cheap and uncached.
The health path now also uses bounded timeouts for both the dependency check and the optional internal queue snapshot. The Supabase reachability query carries its own abort signal, so a slow dependency check is cancelled at the timeout boundary instead of leaving an upstream request running after the endpoint has already degraded. If Supabase is slow or the queue snapshot cannot complete quickly, the endpoint degrades fast instead of hanging until the platform kills the request.
Healthy Response
{
"status": "ok",
"checked_at": "2026-03-21T15:00:00.000Z",
"services": {
"app": "ok",
"supabase": "ok"
}
}Degraded Response
If the app can respond but a required dependency check fails, CE Pro returns 503 with a degraded status.
Startup Environment Validation
The shared Supabase bootstrap path now validates its required environment variables from one place instead of silently failing later in request handling.
That baseline validation currently covers the critical Supabase runtime config:
NEXT_PUBLIC_SUPABASE_URLNEXT_PUBLIC_SUPABASE_ANON_KEYSUPABASE_SERVICE_ROLE_KEY
If one of those values is missing or malformed, the app now fails loudly instead of limping into production with partial behavior.
Global Error Recovery
CE Pro now includes app-wide error boundaries for route-level and full-app crashes.
That means a render failure should now fall back to a recovery screen with a retry action instead of dropping the user onto a blank white page with no path forward.
Route-specific error boundaries can still exist where a workflow needs a more tailored recovery state.
The shared fallback also routes admins, logged-in customers, and public visitors back to the right surface instead of always sending everyone to the admin dashboard.
Background Delivery Jobs
CE Pro now moves more slow outbound delivery work off the request path and onto the shared background jobs queue.
That queue now handles:
- one-time campaign fan-out batches
- manual admin inbox email sends
- commercial fleet and building proposal sends, including PDF generation, email, SMS, and delivery-state updates
- forgot-password recovery emails
- fresh team setup-link emails
- self-serve account password reset emails from Settings
The main effect is that the app can acknowledge the request faster while the minute-by-minute job worker handles the actual send work in the background.
Queued email-style delivery work now also gets retry headroom instead of acting like a one-shot fire-and-forget handoff.
That means:
- forgot-password sends retry on transient provider failures
- setup-link and account-reset sends retry instead of silently dead-lettering after one blip
- queued admin inbox emails retry before they are marked failed
What Users See
- Marketing sends now queue batches instead of holding the browser open while every recipient is processed inline.
- Manual admin emails show as queued instead of pretending delivery already finished.
- Commercial proposal sends now show as queued while the background worker renders the PDF, sends email/SMS, and records delivery state. Set
COMMERCIAL_PROPOSAL_SEND_QUEUE_DISABLED=trueto roll this path back to synchronous delivery. - Password recovery and setup-link requests now rely on the background worker instead of waiting on synchronous provider delivery inside the request.
Hot-Read Server Caching
Phase 2 also adds short-lived server-side caching to a few high-traffic office reads so common dashboard and setup screens stop re-running the same database work on every load.
The current cache coverage includes:
- admin dashboard metrics
- services catalog reads
- add-on structure reads
- lead source reads and lead-source lookup helpers
- pricing config reads
- tag summary reads
These caches are intentionally short and org-scoped.
Important notes:
- Admin reference-data writes now invalidate their matching cache tags.
- Cached responses still keep the existing browser cache headers where those were already part of the route contract.
- Dashboard data currently relies on TTL-based refresh rather than write-trigger invalidation, which keeps the implementation safe while still cutting read pressure.
- Lead-source admin reads also now avoid selecting a non-existent
updated_atcolumn on workspaces where that field is absent, which closes the stress-test500that was keeping the lead-source cache path from helping at all.
Shared Org Cache
The short-lived server caches now use a shared org-cache abstraction with standard bucket types instead of one-off inline cache definitions.
That makes it easier to review:
- cache scope
- TTL intent
- org isolation
- invalidation tags
The current buckets in use are:
referencedashboardsettings
RSC-First Admin Lists
The Jobs and Invoices workspaces now load their first page on the server and hydrate into client-side query state instead of waiting for a mount-time waterfall before anything useful appears.
That means:
- first paint is more useful on those pages
- the first page is already present when the client boots
- follow-up pagination and refreshes reuse the same query path
Query-Backed Pagination
The Jobs and Invoices admin workspaces also moved off the old fixed 200 row cap and onto explicit paginated list APIs.
That reduces overfetching and gives those pages a clearer path to larger-office datasets.
Important note:
- Some list-level summary cards still intentionally describe the loaded slice, not the full matching dataset, unless the page explicitly labels otherwise.
- Jobs and Invoices search and status filtering now run through the paginated query path too, so records do not appear to vanish just because they live beyond the first loaded page.
- Jobs and Invoices now sanitize free-text search terms before they are interpolated into the PostgREST filter DSL, which avoids malformed search expressions on punctuation-heavy queries.
- Jobs and Invoices now also have dedicated
(org_id, api_environment, created_at DESC)indexes for the primary admin list path, so those pages benefit from the same scaling work instead of relying on less-specific secondary indexes. - Jobs and Invoices now also treat out-of-range page offsets as empty pages instead of surfacing a server error when a workspace does not have enough rows to satisfy the requested offset yet.
- That out-of-range protection now matches the real PostgREST response shape too, including the
416 Range Not Satisfiablecase where the HTTP status is returned alongside the error payload instead of inside it.
Super-Admin Browser QA
The repo now includes a read-only super-admin portal browser QA script:
npm run qa:super-admin -- --base-url=https://cleanestimate.pro --magic-linkRun it with SUPER_ADMIN_EMAIL, NEXT_PUBLIC_SUPABASE_URL, and SUPABASE_SERVICE_ROLE_KEY available in the environment. The script verifies that signed-out users are redirected away from /super-admin, signs in through a one-time Supabase magic link for the configured super-admin identity, opens the Dashboard, Organizations, organization detail, Users, Waitlist, and Activity surfaces, and records screenshots plus report.json under output/playwright/super-admin-portal-qa/.
The super-admin Users and Waitlist tables now use deterministic date rendering during hydration so browser QA does not surface React text-mismatch errors on those operational pages. The production CSP also allows the Google Analytics collect endpoint currently used by the deployed tag, which prevents analytics CSP noise from hiding real super-admin portal failures.
Signup & Onboarding Browser QA
The repo now also includes a live self-serve signup and onboarding browser QA script:
npm run qa:signup -- --base-url=https://cleanestimate.proRun it with AGENTMAIL_API_KEY, NEXT_PUBLIC_SUPABASE_URL, SUPABASE_SERVICE_ROLE_KEY, and INTERNAL_API_SECRET available in the environment. The script creates a fresh AgentMail inbox, signs up a new owner and organization through /signup, completes the Company Info, Services, Pricing, and Get Started onboarding steps, verifies the new owner can reach /admin, checks the database state for the owner/org/membership/trial, confirms password login from a clean browser session, then sends and opens a real owner access-link email delivered to AgentMail.
Evidence is written under output/playwright/signup-onboarding-qa/ with full-page screenshots and report.json.
The admin dashboard's Lead Sources shortcut now points at the live /admin/analytics/lead-source-roi report instead of prefetching a missing analytics route, which keeps browser QA console output focused on actual signup and auth failures.
Invoice Number Allocation
Invoice creation now allocates invoice numbers through one shared atomic counter path across:
- admin single-invoice creation
- admin bulk invoice creation
- API v1 invoice creation
That removes the older count-then-insert race where concurrent creates could reach the same invoice number under load.
The allocator also seeds itself from any already-existing invoice numbers for the same org and date, so rollout does not break on workspaces that created invoices before the counter-backed path existed.
Phase 4 Audit Follow-Up
The post-Phase-4 review also shipped a few correctness fixes that are easy to miss if you only look at the higher-level roadmap:
- lead-source reference reads now respect the
includeAll=falseactive-only filter again - server-hydrated Jobs and Invoices data now marks the initial React Query payload as fresh instead of immediately re-fetching on mount
- the background-job cron now claims a smaller bounded batch per run so slow workers are less likely to hit the serverless timeout and leave claimed jobs waiting for stale-lock recovery
Structured Logging And Queue Health
The background job processor and internal delivery workers now emit structured log events instead of free-form console output.
Those logs make it easier to separate:
- healthy empty worker runs
- claimed batch runs
- retrying jobs
- dead-letter failures
The structured logger now also avoids writing full recovery-email and admin-email recipient addresses into production logs. Those logs keep enough masked recipient detail for debugging without leaving plaintext inbox addresses in the log stream.
The health endpoint also supports an internal queue snapshot when called with the same bearer secret used by the cron worker.
Use the dedicated Operations Runbook for alert thresholds, queue checks, and restore-readiness expectations.
Use the dedicated Stress Testing guide for the repeatable 1K-user readiness drill, default session mix, cookie-pool setup, and pass/fail thresholds.
If you are moving CE Pro to a new hosted Supabase project or region, use the dedicated Supabase Region Migration guide before switching traffic.
Queue Correctness Follow-Up
The background job dedupe layer now only blocks duplicate work while a matching job is still in an active queue state:
pendingprocessingretrying
Completed, failed, and dead-letter jobs no longer permanently reserve the same dedupe key, which means legitimate follow-up queueing can happen again after the earlier work has fully finished.
Campaign dispatch stat reconciliation also now runs through a database-side sync step instead of an app-side read-then-write cycle. That removes the race where multiple finishing workers could overwrite each other's campaign status or message totals.
The campaign batch worker now isolates failures at the recipient level instead of aborting the whole batch on the first provider error. A single transient email failure no longer prevents the rest of the batch from being attempted, and the stats sync step still runs after the batch work completes.
The background-job processor also now reports jobs that got stuck before they could be finalized as a separate stuck_processing outcome instead of counting them as dead-letter work they never actually reached.
Recovery Email Burst Protection
Forgot-password requests now rate-limit by normalized email address instead of only by source IP.
That means:
- rotating IPs or VPN exits no longer bypass the per-address recovery limit
- rapid duplicate forgot-password requests collapse onto one queued recovery job instead of stacking multiple identical email sends
Permission Hardening Follow-Up
The Phase 3 admin-route hardening pass now keeps read and destructive permissions separated on the detail routes that were still inconsistent.
The two practical outcomes are:
- estimate delete operations now require
estimates.manageinstead of slipping through the same top-level gate used for estimate reads - lead detail reads now use the view-side client permission instead of the stronger write tuple used for lead mutation routes
- invoice attachment reads and writes now follow billing permissions instead of estimate permissions, so billing-only roles can manage invoice files without inheriting estimate access and estimate-only roles do not gain invoice attachment visibility by accident
That keeps destructive behavior aligned with the route intent while avoiding unnecessary lead-detail lockouts for read-capable office roles.
Log Redaction Follow-Up
The Phase 5 structured logging pass now masks email addresses on the enqueue-failure paths too, not just in the background workers that actually deliver mail.
That means recovery-email, setup-link, account-reset, and admin-email queue failures no longer write plaintext recipient addresses into structured logs while operators are debugging a queue outage.
API V1 CORS Cache Safety
The API v1 response helper now only emits Vary: Origin when it also emitted an Access-Control-Allow-Origin header.
That makes cached server-to-server responses less likely to interfere with later browser requests on installations that put an edge cache in front of the external API surface.
Related articles
Was this article helpful?
Still need help? Contact support