Stress Testing
Run the repeatable 1K-user readiness drill against CE Pro before launches and scale changes.
Stress Testing
CE Pro now ships with a repeatable stress-test harness for the office and platform paths most likely to matter as traffic grows.
The goal is not to win a benchmark. The goal is to answer a much more useful question:
Can this deployment handle a realistic jump toward 1,000 users without falling apart on the hot admin, health, and queue paths?
Telemetry Gate Before Stress Runs
Production telemetry comes first. Stress tests are now a validation tool after the team knows which production routes matter, not the first source of truth for Phase 1 work.
Before ranking scale fixes, collect 3-7 days of production traffic and export the Phase 0 baseline:
npm run telemetry:production -- --days=7 --out=reports/production-telemetry.mdUse that report to identify the top slow routes, run EXPLAIN ANALYZE on the top slow SQL statements, and classify each fix as a missing index, N+1 pattern, exact count, bad join, or larger architecture issue.
If the report shows current SLOs are already met at the current measured load, halt speculative Phases 1-6 and re-evaluate quarterly. If the report shows real hot paths, run focused stress tests against those routes after the cheap fixes or queueing changes land.
What The Harness Exercises
The default route mix focuses on:
GET /adminGET /admin/jobsGET /admin/invoicesGET /api/admin/jobs?limit=25&offset=*GET /api/admin/invoices?limit=25&offset=*GET /api/admin/servicesGET /api/admin/lead-sourcesGET /api/admin/tagsHEAD /api/health- optional
GET /api/healthwith the cron bearer secret for queue visibility
That mix intentionally leans on the exact paths touched in the scale-hardening work:
- paginated Jobs and Invoices reads
- short-lived org-scoped cache reads
- server-first admin page renders
- health and queue visibility
Why It Uses A Cookie Pool
The default test is session-paced, not a single-client flood.
That matters because /api/admin/* now has a baseline per-user rate limit. If you hammer the admin APIs from one cookie, you mostly learn that your own guardrail works.
A realistic office-load test should spread requests across a pool of authenticated admin sessions so the results reflect multi-user behavior.
Default 1K-Ready Profile
The default profile is:
warmup:8sessions for60ssteady:20sessions for180speak:40sessions for300sspike:60sessions for120s
Think time defaults to 900ms through 2500ms between requests.
That is a much better model for a 1,000-user business app than firing 1,000 requests at once from one process.
Running The Test
Create a local file with one full admin Cookie header value per line, then run:
STRESS_BASE_URL=https://app.cleanestimate.pro \
STRESS_ADMIN_COOKIE_FILE=.secrets/stress-admin-cookies.txt \
STRESS_CRON_SECRET=replace-me \
npm run test:stressTo inspect the resolved plan without sending traffic:
npm run test:stress -- --dry-runEach run writes a JSON report into stress-reports/ unless you pass a custom --output.
Pass Criteria
Treat the default drill as healthy when all of these remain true:
- error rate stays at or below
2% 429rate stays at or below5%- overall
p95latency stays at or below1500ms - health-check
p95stays at or below500ms - queue backlog does not grow continuously during the run
- dead-letter work does not spike during the test window
When A Run Fails
If throttling is the main failure:
- add more admin cookies before assuming the app itself is the bottleneck
- confirm you are not over-driving the same small set of users into the baseline admin limiter
If Jobs or Invoices are the dominant latency source:
- review the active list queries
- compare the org size against the search terms and offsets used in the run
- confirm the latest indexes and migrations are live in the target environment
If health checks degrade:
- inspect Supabase latency
- inspect queue backlog and oldest ready age
- confirm the queue snapshot path is still timing out quickly instead of hanging
Related Docs
Was this article helpful?
Still need help? Contact support