Still kind of sick about this one. Writing it down because maybe someone else is doing the same thing without knowing.
Quick context — I run a small trading bot, about 50 paying users on it. Bybit, mostly futures. The bot has a reconciliation routine that runs every few seconds: pulls the current open positions from the exchange, compares to what it thinks is open locally, and if something's missing on the exchange side it assumes the position got closed (manually by user, by liquidation, whatever) and cleans up the orphaned SL and TP orders for that symbol. Pretty standard hygiene stuff. You don't want stale stops sitting there forever.
Here's what I didn't know. Bybit's bulk positions endpoint sometimes — maybe 1-2% of the time — returns an incomplete list. Not an error. Not empty. Just fewer positions than you actually have open. No flag, no warning, just less data than reality.
So the bot would look at the response, not see position X, conclude X must have been closed, and helpfully cancel the stop-loss on X. Which was still open. Still leveraged. Still very much exposed.
Six months. In production. On real money.
I didn't catch it from logs. I didn't catch it from monitoring. I caught it because one user dropped a casual line in chat — "hey my AAVE trade closed in profit but I had to do it manually, the TP didn't fire". I went and pulled the logs for her account. The TP had been placed at entry, looked correct. Then 12 minutes later my own reconciliation routine had cancelled it. The position was open the whole time. She ended up okay because price moved her way and she was watching, but if it had moved the other way she'd have been holding a leveraged position with no stop.
Started checking other users. Same story. Different symbols, different days, different amounts of damage. Some people had probably been losing money for months and just assumed they'd set something up wrong.
What really gets me is why it took me so long to find. Three things, all my fault.
The bug was rare enough that most users never hit it badly. The ones who did mostly blamed themselves — "I probably didn't enable SL properly", "maybe I configured something wrong". Nobody assumes the bot is actively cancelling their stops behind their back. And the worst part — my own logging was deceiving me. The reconciliation logged "cleaning up orphaned orders for closed position X" every time it fired. When I scanned logs I read that as "good, cleanup is working." It was confirming the bug to itself in language that sounded like success.
That last one I'm going to think about for a while. You can stare at a log line every day for half a year and never see it because it sounds correct.
Fix is in now. Three layers, basically going from cheap to paranoid.
First — never trust a single positions fetch. If a position looks missing, do a second fetch with a small delay before doing anything. Both fetches have to agree the position is gone. That alone killed about 95% of false positives.
Second — even if both fetches agree, before cancelling the SL/TP go check the order history for an actual close event in the last few minutes. If there's no close event recorded — don't touch the orders. Position state and order history disagreeing should be a red flag, not an action item.
Third — hard rate limit on how many orders the reconciliation can cancel per cycle. If something else goes wrong and the previous two checks fail somehow, the damage is at least bounded. Can't wipe out everyone's stops in one bad cycle.
Two weeks in production now, zero false cancellations.
Anyway. If you're running a bot that does any reconciliation against exchange state — I'd seriously go look at it right now. Specifically what happens when the exchange response is partially correct. Not wrong, not empty, just incomplete. That was the case I never tested for and it cost real people real money.
Happy to answer questions if anyone's debugging similar stuff.