We Replaced 40 Legacy Systems: Here's What We Got Wrong

Wait 5 sec.

Two and a half years. Forty-something legacy systems. One new platform on EKS. By any external measure, the program was a success: it shipped, it absorbed the legacy, the lights stayed on. By any internal measure, the first year was a mess; we're lucky we recovered from it. This is the postmortem I wish I had read before we started.What we got right:A short list, because the wrong stuff is more interesting. We were right to build a routing façade first; we were right to pick a single platform target (EKS) instead of letting every team choose their own runtime; we were right to invest in shared observability before we wrote any business code. The platform foundations held up.Mistake #1: We started with the easiest system:It made sense at the time. Build credibility, prove the platform, ship a win. So, we migrated a small read-only lookup service first. It went smoothly. The team celebrated. The executive sponsor sent a "great job" email.\The problem: an easy first migration teaches you nothing about the hard ones. We learned nothing about database extraction, nothing about CDC pipelines, nothing about staff retraining, nothing about the API contract drift that would chew up six months of our second year. We should have started with a medium-complexity system, one that included a database, a downstream integration, and a non-trivial user-facing change. Even if it took 3x longer.\What stuck with me: pick your first migration for what it teaches, not for what it ships. A "win" that doesn't surface the platform's weaknesses isn't a win; it's a delayed disaster.Mistake #2: We Let "No New Features in the Legacy" Be a Verbal Agreement:Everyone agreed at the kickoff. "We will not add new features to the legacy systems while we migrate them." It was in the slides. It was in the steering committee notes. It was in nobody's engineering policy.\For nine months, teams added features to the legacy because the velocity felt higher, the customers were happier, and nobody enforced the rule. Every legacy feature added during the migration window became a feature we then had to re-migrate. We estimated, in retrospect, that this cost us four months of program time.\The fix: a written policy with a sign-off from the engineering VP, posted in the wiki, enforced at the pull-request level (a CI check that blocked changes to certain legacy directories without an explicit override). Verbal agreements lose to weekly delivery pressure. Written ones survive it.Mistake #3: Contract Tests Were "Nice to Have" Until They Were Not:We had unit tests. We had integration tests. We didn't have contract tests between the new services and their callers. So, when we cut over the customer service, three callers that depended on a deprecated field broke silently. The field was returning empty strings instead of nulls. None of us caught it for nine days. A handful of downstream batch jobs produced wrong data for that whole window.\After that, contract tests became table stakes. Every consumer-provider relationship was codified in Pact (or a homegrown equivalent), and every change to a service's response had to pass the consumer tests before it could merge. It added overhead. It also stopped the next nine field-shape regressions before they reached production.Mistake #4: We Underestimated the Database:The services were the visible work. The database extraction was where the real time went. Twelve of the forty legacy apps wrote to the same customer table. Extracting "customer" as a service meant migrating every one of those write paths through the new service's API. Each one had its own ORM, its own assumptions, its own undocumented invariants.\We estimated the database work as 20% of the program. It was closer to 45%. The fix in hindsight: budget the database extraction at parity with the service work, not as a tail item. And do not try to fork the database until every writer has been migrated for at least two release cycles; premature forking caused two of our worst incidents.Mistake #5: We Treated the API Gateway as Infrastructure, Not a Product:The gateway was set up by the platform team and then handed off to nobody. There was no owner, no roadmap, no backlog. When teams asked for new routing behaviors, request transforms, or auth modes, the answers were "the platform team is busy" or "we will get to it." Migrations stalled waiting on the gateway.\In month twelve, we made the gateway a product with a dedicated owner, a roadmap, and an on-call rotation. Migration throughput roughly doubled in the following quarter. The takeaway, for me: shared infrastructure that everyone depends on needs a product owner with real authority, or it becomes the bottleneck the whole program waits on.Mistake #6: We Confused "Cut Over" With "Decommissioned":When we routed traffic to a new service, the team would celebrate, mark the migration done in the program tracker, and move to the next thing. The legacy code kept running. It stayed running for an average of eight months past cutover. It stayed running because nobody owned the deletion, and because every cutover had a footgun, nobody wanted to be the first to disarm.\The result: the original "we'll save infrastructure cost" pitch did not materialize until late year two, when we finally instituted a "delete the legacy code 90 days after cutover" policy with an executive sign-off requirement to extend.What I Would Do Differently on Day One:Pick a medium-complexity first migration that surfaces platform weaknesses early.Codify the "no new features in legacy" rule as a CI policy on day one.Require contract tests before any cutover; treat them as part of the platform.Budget database extraction at parity with service work, not 20%.Make the API gateway a product with a named owner.Decommission within 90 days; track it like a delivery metric.The Honest Verdict:The program shipped. The new platform absorbed the legacy. By month 28, the team's deploy cadence had gone from quarterly to multiple times per day. The customer-facing wins were real. None of that erases the fact that we spent the first year learning lessons that more careful planning would have spared us.\If you are about to embark on a multi-system consolidation, read this list twice, and assume you'll rediscover at least three of these the hard way anyway. The hardest part of modernization is not the technology. It is the discipline to make agreements stick.