we gave a client 3x faster API responses. the fix had nothing to do with code.
p99 sitting at 600ms. we checked everything. execution plans, indexes, N+1s, connection pool. added Redis caching on the heaviest endpoints. hit rate came back at 4% because param variation meant almost nothing reused. three weeks in and we'd basically just added a round trip to every request.
then someone looked at the NGINX config. keepalive_timeout was 2 seconds. keepalive_requests was 10. under real load, connections were tearing down and renegotiating constantly, every upstream request paying SSL handshake and TCP setup overhead. gzip wasn't on for API responses. upstream keepalive between NGINX and the app wasn't configured at all. fix was a config file. 40 minutes. p99 dropped to under 200ms. some endpoints hit sub-100ms. zero application changes.
the answer was sitting in a config file the whole time while we were rewriting cache invalidation logic. if your app looks healthy and your db looks healthy, check what's sitting between them before you touch anything else. what's the most embarrassing place you've found a bottleneck?