u/supreme_tech

▲ 10 r/Backend

we gave a client 3x faster API responses. the fix had nothing to do with code.

p99 sitting at 600ms. we checked everything. execution plans, indexes, N+1s, connection pool. added Redis caching on the heaviest endpoints. hit rate came back at 4% because param variation meant almost nothing reused. three weeks in and we'd basically just added a round trip to every request.

then someone looked at the NGINX config. keepalive_timeout was 2 seconds. keepalive_requests was 10. under real load, connections were tearing down and renegotiating constantly, every upstream request paying SSL handshake and TCP setup overhead. gzip wasn't on for API responses. upstream keepalive between NGINX and the app wasn't configured at all. fix was a config file. 40 minutes. p99 dropped to under 200ms. some endpoints hit sub-100ms. zero application changes.

the answer was sitting in a config file the whole time while we were rewriting cache invalidation logic. if your app looks healthy and your db looks healthy, check what's sitting between them before you touch anything else. what's the most embarrassing place you've found a bottleneck?

reddit.com
u/supreme_tech — 23 hours ago

our ai demo looked perfect. then real users destroyed it in a week.

we had this ai feature working pretty nicely in staging. clean json coming in, small files, predictable prompts, response time was fine and the model was giving decent answers. logs were clean too, so honestly we thought we were mostly done with the risky part.

then actual users started using it and the whole thing got messy pretty fast. first week only and we were already seeing stuff we never tested properly. pdfs with weird formatting, tables copied from excel, prompts with half the info missing, random extra context pasted in, all that kind of stuff. the annoying part was that most of it still passed validation. api returned 200, latency looked fine, no scary errors in logs. so from backend side it looked healthy.

but the answers were not always healthy. some were slightly wrong, some missed important sections because the parser skipped chunks, some retrieval results were just not relevant enough, and our fallback logic was basically too polite. it kept trying to answer instead of saying it did not have enough clean input for this. we were tracking uptime and token usage like that was enough, but we were not really tracking retrieval misses, low confidence outputs or bad answer patterns.

so yeah, lesson learned the hard way. an ai demo working in staging does not mean the feature is ready for real users. staging tests the happy path. production tests every weird thing people can possibly upload or type. curious what broke first for you after shipping an ai feature?

reddit.com
u/supreme_tech — 2 days ago

spent 2 years as 1 of 2 devs on a live aviation platform inside a 500 person org. heres what it actually taught me about growing as a developer.

the scariest moment in my career wasnt a job interview or a performance review.

it was pushing to production on a platform that flight departments in 50+ countries were actively using. real trips. real operations. a bad merge here doesnt just break a UI it literally disrupts actual flights.

i was 1 of 2 devs inside a 500+ person engineering org. we had to migrate the entire frontend from Knockout.js to React with TypeScript while the platform was live. no downtime allowed. ever.

what it taught me was honestly more valuable than any course or bootcamp ive ever done.

you learn real discipline when touching shared code means coordinating with teams you sometimes never even met before. some weeks we reviewed more than we coded. that kind of restraint is not something you pick up on side projects or personal builds.

you also learn how to stay calm under live pressure. staging environments straight up lie to you. real bugs only show up when real users are doing real things in ways you didnt anticipate. you just gotta build the muscle to fix things fast without loosing your head.

and you learn that showing up consistently over 2 years beats any single heroic day.

every module migrated. platform runs faster. zero downtime.

for anyone earlier in their career has working inside a large org shifted how you think about your craft? curious what moment actually leveled you up

reddit.com
u/supreme_tech — 5 days ago

built a risk scoring system for law enforcement. it was silently wrong for 6 weeks and nobody caught it

so me and my colleague had been grinding on this blockchain wallet risk scoring platform for like 3 months at that point. basically law enforcement was using it. real cases real people the whole deal. and for six weeks straight everything just.. worked yknow? queries fast logs clean nobody blowing up our phones. we were honestly feeling pretty good about ourselves ngl.

then out of nowhere a detective calls and goes "hey that wallet you guys flagged low risk just sent directly to a sanctioned address." i literally just sat there for a second like.. wait what. pulled up the database straight away half expecting some dumb sync issue or missing record but nope transaction was just sitting right there. fully ingested perfectly stored timestamped correctly. we had the data the whole entire time. we just never actually scored it right. dashboard showing 100% pipeline completion zero errors everything green as can be. and we were still pointing investigators in the wrong direction. not great.

took us 3 days to track it down tbh. we'd set a node budget on graph traversal 50k nodes which is pretty standard stuff. but when it hit that cap it didnt error didnt return null didnt log anything at all. just quietly returned whatever partial score it had managed to build so far and it looked completely identical to a finished one. same shape same confidence same everything. this particular wallet had insane transaction volume so the graph just burned through the entire budget on a totally unrelated branch and never even got close to the sanctioned connection. system had absolutely no idea it hadnt finished. and honestly neither did we lol.

fix was pretty straightforward once we finally found it. anything hitting the cap now returns inconclusive instead of an actual score. high volume wallets get pre-computed at ingestion. coverage ratio required on every single response now no exceptions. but tbh the fix isnt really what keeps me up at night. its those six weeks man. we built a hard computational limit with literally zero way of saying "hey i didnt finish" in a system directly affecting real investigations. a system that crashes atleast has the decency to tell you something went wrong. a system that silently returns incomplete results as complete ones is a whole different level of scary because it never once gives you a reason to question it. anyway curious if anyone else has been here. whats the most confident looking output your system ever returned that turned out to be completely wrong?

reddit.com
u/supreme_tech — 5 days ago

How We Transformed an Enterprise Platform with Automation, Testing, and CI/CD Pipelines

So, we got this client who asked us to help level up their next-gen project portfolio management tool. The goal was pretty simple: "build a platform that actually gives transparency and control over project execution".

But once we dug in, we realised the real challenge wasn’t just getting the system to work. It was about optimising the backend to handle a massive and diverse user base. It wasn’t enough for the platform to just function. It had to scale smoothly when traffic picked up, which is easier said than done.

To make it happen, we dove into automated API testing with Postman and RestAssured. Our main focus was making sure the data stayed solid and the backend could hold up, even under heavy loads.

We ran load tests to find the bottlenecks, and yeah, we found a few spots where the system could break under stress. Once we had that info, we plugged automated testing scripts right into the Azure CI/CD pipeline.

So, every time we pushed a new code commit, the system automatically ran the tests, which saved us a ton of time and helped resolve issues faster. We also ran cross-browser tests, tweaking things to make sure everything worked smoothly across different environments without messing with the user experience.

The results? The platform became way more stable, bugs were way fewer, and we were able to release features faster than ever.

The best part? It became scalable enough to handle high traffic without slowing down. Have you worked on something similar for enterprise-level projects?

How do you deal with performance testing, CI/CD pipelines, and backend automation?

Let’s connect and chat about how we can keep improving backend systems!

reddit.com
u/supreme_tech — 6 days ago
▲ 0 r/word

So we just wrapped up a website for a geospatial engineering firm operating across 5 continents in Oil & Gas, Offshore Wind and subsea surveying. Pretty serious client. The site had 3D visuals, interactive maps and large hero sections and honestly the Core Web Vitals were just awful. Everyone on the team assumed the rich media was the problem. Nope, not even close. The previous build was using a popular page builder that was quietly loading CSS and JS for features the site never even used. Dozens of registered scripts firing on every single page. Like the DOM was completely bloated before a single image even rendered. Wild.

We said forget it and scrapped the whole thing, built a custom PHP theme from scratch. No Elementor, no Divi, none of that stuff. Just clean markup where every single line had a reason to be there. Render-blocking assets went from 11 stylesheet calls down to inlined critical CSS with everything else deferred. Image prioritization was the next thing we dug into and tbh this one surprised us. Files were already sized fine but the sequencing was all wrong. Just adding fetchpriority="high" to above-fold images moved LCP way more than we expected. CLS was probably the most frustrating part because it turned out to have three completely separate causes: font swapping without reserved space, embedded maps with no explicit dimensions and dynamic blocks loading without height reservations. Yeah theres just no single fix for that one. You literally have to hunt them down one by one.

Site now serves clients across UAE, Norway, India, Qatar, Saudi Arabia and Brunei with solid scores across every region so that feels pretty good. At the end of the day WordPress isnt slow. The abstraction layer sitting on top of it usually is. Has anyone else had to convince a client to drop a page builder for performance reasons and how did you actually frame that conversation?

reddit.com
u/supreme_tech — 13 days ago

We built an AI-based PCB inspection system and the goal looked simple at first. Capture a board image, detect missing or misaligned components, return pass or fail and keep the inference fast enough so it could actually be used in production. The first version looked pretty solid in testing. YOLO was detecting the main defects, the UI was working fine and test accuracy was around 85%. But once we got closer to real factory-floor conditions, the results started getting inconsistent in ways our test setup never really showed.

The first problem was not even the model. It was image quality. PCB surfaces are reflective and small changes in lighting, board position, camera angle or even component height were creating shadows that affected detection. At first we kept trying to tune the model but the bigger fix was actually cleaning up the input pipeline. We added more controlled diffuse lighting, normalized images before inference and started checking raw image samples properly before blaming the model. That alone improved consistency more than we expected.

The second issue was the dataset. Our test data was too close to the training data so that 85% accuracy was not really proving generalization. When we tested on denser PCB variants, performance dropped. So we had to rebuild the annotation workflow with cleaner labels, more defect variation, better negative examples and a process to keep improving the dataset instead of treating labeling like a one-time task.

The third issue was sustained inference performance. Full-resolution inference looked okay in short tests but the fanless industrial PC behaved differently after running for hours. Cold benchmarks did not show thermal limits or frame delays. We ended up changing the pipeline. Normalize lighting, crop the region of interest, run detection only where it mattered, log results properly and keep model training separate from live inference.

Main lesson for us was that computer vision accuracy in a controlled test does not mean much until lighting, camera setup, hardware limits, operators and real product variants are part of the evaluation.

For people running vision systems in production, where do most of your accuracy problems usually come from?

Model selection, dataset quality, lighting setup, preprocessing or hardware constraints?

reddit.com
u/supreme_tech — 14 days ago

Okay so it is 3am and I cannot sleep so I am just going to write this out. We built a risk scoring tool for law enforcement. User pastes a wallet address, system traces it through the blockchain graph, comes back with a risk score based on whether that wallet ever touched anything dirty. Sanctioned addresses, mixers, fraud clusters. We shipped it, it ran clean, nobody complained, we felt good about it. Then we got a call. A wallet we had scored as low risk had sent directly to a sanctioned address. We pulled up the database and the transaction was just sitting there. Fully ingested. Perfectly stored. We had it the whole time. We just never flagged it. And our dashboard was still showing 100% completion, zero errors, everything green. Nothing anywhere suggested anything had gone wrong.

It took us three days to find it and when we did I just sat there for a minute. We had set a node budget on the graph traversal. 50,000 nodes, completely reasonable. But when the traversal hit that cap it did not throw an error or mark anything as incomplete. It just quietly returned whatever partial score it had built so far and called it done. The response looked identical to a full traversal. This particular wallet had massive transaction volume so the graph burned through the entire budget on a completely unrelated branch and never reached the sanctioned connection. The system was not broken. It was just confidently wrong and had absolutely no way of telling us that.

The fix was not complicated. Traversals that hit the cap now return inconclusive instead of a score and high volume wallets get precomputed so we never hit the cap during a live query. But honestly I keep thinking about the failure more than the fix. There was no crash, no alert, no spike in any metric. The system just kept going, kept returning scores, kept looking perfectly healthy. A system that crashes is at least honest. A system that does not know when it is guessing is something else entirely. Has anyone else shipped something that looked completely fine on the surface but was quietly wrong the whole time? How did you even catch it and what changed after?

reddit.com
u/supreme_tech — 16 days ago