Stress Testing

Run the repeatable 1K-user readiness drill against CE Pro before launches and scale changes.

IntermediateownermanagerdeveloperUpdated 2026-04-26

Stress Testing

CE Pro now ships with a repeatable stress-test harness for the office and platform paths most likely to matter as traffic grows.

The goal is not to win a benchmark. The goal is to answer a much more useful question:

Can this deployment handle a realistic jump toward 1,000 users without falling apart on the hot admin, health, and queue paths?

Telemetry Gate Before Stress Runs

Production telemetry comes first. Stress tests are now a validation tool after the team knows which production routes matter, not the first source of truth for Phase 1 work.

Before ranking scale fixes, collect 3-7 days of production traffic and export the Phase 0 baseline:

npm run telemetry:production -- --days=7 --out=reports/production-telemetry.md

Use that report to identify the top slow routes, run EXPLAIN ANALYZE on the top slow SQL statements, and classify each fix as a missing index, N+1 pattern, exact count, bad join, or larger architecture issue.

If the report shows current SLOs are already met at the current measured load, halt speculative Phases 1-6 and re-evaluate quarterly. If the report shows real hot paths, run focused stress tests against those routes after the cheap fixes or queueing changes land.

What The Harness Exercises

The default route mix focuses on:

  • GET /admin
  • GET /admin/jobs
  • GET /admin/invoices
  • GET /api/admin/jobs?limit=25&offset=*
  • GET /api/admin/invoices?limit=25&offset=*
  • GET /api/admin/services
  • GET /api/admin/lead-sources
  • GET /api/admin/tags
  • HEAD /api/health
  • optional GET /api/health with the cron bearer secret for queue visibility

That mix intentionally leans on the exact paths touched in the scale-hardening work:

  • paginated Jobs and Invoices reads
  • short-lived org-scoped cache reads
  • server-first admin page renders
  • health and queue visibility

The default test is session-paced, not a single-client flood.

That matters because /api/admin/* now has a baseline per-user rate limit. If you hammer the admin APIs from one cookie, you mostly learn that your own guardrail works.

A realistic office-load test should spread requests across a pool of authenticated admin sessions so the results reflect multi-user behavior.

Default 1K-Ready Profile

The default profile is:

  • warmup: 8 sessions for 60s
  • steady: 20 sessions for 180s
  • peak: 40 sessions for 300s
  • spike: 60 sessions for 120s

Think time defaults to 900ms through 2500ms between requests.

That is a much better model for a 1,000-user business app than firing 1,000 requests at once from one process.

Running The Test

Create a local file with one full admin Cookie header value per line, then run:

STRESS_BASE_URL=https://app.cleanestimate.pro \
STRESS_ADMIN_COOKIE_FILE=.secrets/stress-admin-cookies.txt \
STRESS_CRON_SECRET=replace-me \
npm run test:stress

To inspect the resolved plan without sending traffic:

npm run test:stress -- --dry-run

Each run writes a JSON report into stress-reports/ unless you pass a custom --output.

Pass Criteria

Treat the default drill as healthy when all of these remain true:

  • error rate stays at or below 2%
  • 429 rate stays at or below 5%
  • overall p95 latency stays at or below 1500ms
  • health-check p95 stays at or below 500ms
  • queue backlog does not grow continuously during the run
  • dead-letter work does not spike during the test window

When A Run Fails

If throttling is the main failure:

  • add more admin cookies before assuming the app itself is the bottleneck
  • confirm you are not over-driving the same small set of users into the baseline admin limiter

If Jobs or Invoices are the dominant latency source:

  • review the active list queries
  • compare the org size against the search terms and offsets used in the run
  • confirm the latest indexes and migrations are live in the target environment

If health checks degrade:

  • inspect Supabase latency
  • inspect queue backlog and oldest ready age
  • confirm the queue snapshot path is still timing out quickly instead of hanging

Was this article helpful?

Still need help? Contact support