Rearchitecting a 8+ year-old OMS, am I about to make the classic mistakes?
I'm working on an Order Management System for a retailer. ~300K transactions/day, 24+ Spring Boot microservices, 8+ years old, inherited. The system makes real money every day, but the coupling has become existential and before I commit to a rewrite/refactoring, tell me where I'm wrong.
The pain:
- No two services talk directly: Everything routes through a central Camunda orchestrator over RabbitMQ, + a home made mini BPMN framework for correlations and retries. The orchestrator isn't a coordinator anymore, it's a bus with opinions.
- Every release is a major release. Non-trivial changes touch 50%+ of services, BPMN changes aren't backwards-compatible with in-flight instances, so there's no rollback
- One god model, one database, shared as a "common/core" dependency. Change a field, coordinate + 24 deployments. The shared lib is the de facto API contract
- Many other shared libs, logging, monitoring, testing, infra connections (rabbit, kafka, couchbase, ES...) all shared by all microservices
- No config management, 90%+ lives in Helm values and env vars. Changing a threshold = commit, pipeline, pod restart.
- Debugging "how did this order end up in this state" is painful. No real audit trail beyond logs. no versioning, 1 document per order updated again and again
- We have a home-grown Python tool wired into CI/CD to coordinate releases: it decides build order, opens MRs across repos to bump the shared/common libs, and sequences deployments. If you need a tool like this to ship, your services aren't independent.
- For years until I joined, multiple teams ran their own environments and infra, you can imagine the release complexity. A single release could take up to a month
Are these separate problems or one problem wearing seven hats ?
My plan:
- Event-source the aggregates that actually benefit from it (Order, Payment, Inventory, fulfillment...). Leave CRUD things as CRUD. Don't event-source for the sake of it.
- Drop Camunda. Use a lightweight state machine if needed in code + saga orchestrators for cross-aggregate flows.
- Consolidate to one messaging backbone (probably Kafka).
- Kill the shared libs, kill the god model
- OpenTelemetry + proper tracing
- Strangler fig, not big bang.
So how would you tackle a project like this?
What strategy would you adopt? Is there hope, or is this the kind of system you just keep alive until the business replaces it?
A few things I genuinely don't know:
- Camunda is politically load-bearing. Management is attached to it, and frankly it's the only real monitoring and reprocessing capability we have today. "Just drop Camunda" is easy to say but harder when the devs opens Cockpit every morning to unblock orders.
- What did you replace it with that gave you equivalent visibility and reprocessing, not just equivalent orchestration?
- What are the pitfalls I'm not seeing? The ones that only show up 8 months in.
- Strangler fig where's the first cut?
TL;DR: 8-year-old OMS, 24+ microservices that only talk through a central Camunda orchestrator, one god model, one database, shared libs everywhere, a Python tool to coordinate cross-repo releases. Want to rewrite but Camunda is politically load-bearing. How would you tackle this? What pitfalls am I missing ?