Why LLMs Alone Can't Migrate Your Legacy Code, and What I Built Instead

Wait 5 sec.

\I started with the right architecture for enterprise-scale code migration. Testing on real client data taught me how much further the engineering had to go.I've always believed that LLMs work best as components in engineered systems, not as standalone solutions. So when I set out to build a system that automates the migration of hundreds of legacy SAS scripts to PySpark for enterprise clients, I started with that mindset. Deterministic parsing to understand the source code. Structured pattern extraction to capture business-critical data. Validation pipelines to catch what the LLM gets wrong. The hybrid architecture was baked in from day one.And it still wasn't enough.On a production script containing 143 business-critical lookup mappings (insurance product codes to category labels), the LLM generated values with 28.6% accuracy. Not because I'd been naive about LLM limitations. Because the specific ways they fail at enterprise scale went deeper than I expected. The generated code looked correct. It had the right structure, the right function names, even comments explaining the correct logic. The values were just wrong.This article isn't about discovering that LLMs need engineering around them. Most engineers building with LLMs already know that. It's about what I learned building a system that puts those engineering decisions into practice. Four architectural pivots, three months of testing against real client scripts, and a set of hard-won principles about where exactly the boundary between generative and deterministic needs to be.The Problem: Legacy Migration at Enterprise ScaleIn regulated industries like insurance and financial services, SAS has been the dominant data platform for decades. A single organization might have hundreds of SAS scripts, each between 200 and 2,000+ lines, encoding business logic that has been refined, patched, and extended over 10 to 20 years. These scripts calculate insurance premiums, process claims, manage policy portfolios, and generate regulatory reports.Migrating one script manually takes 5 to 8 days of work from an engineer who understands both SAS and the target platform (PySpark, Snowflake, dbt). They need to read the code, understand its business intent, rewrite it in the new language, verify the output matches, and document the result. For a portfolio of 200 scripts, that's roughly 2 to 3 engineer-years of effort.And critically, in regulated industries, no migrated code goes to production without human verification. An engineer must execute it, validate the output against the source, and sign off. That step isn't going away. The question is what the engineer spends the rest of their time doing. Mechanical rewriting, or meaningful review.The natural question was whether LLMs could power a tool that handles the mechanical rewriting for them. SAS is well-documented, PySpark is well-represented in training data, and code translation is a task LLMs demonstrably handle. But production scripts are not textbook examples. They contain:Lookup tables with hundreds of entries. A single SELECT/CASE block might map 143 numeric product codes to category labels. Each mapping is a business rule. Get one wrong and an insurance premium is miscalculated.Stateful accumulation logic. SAS RETAIN statements carry values across rows, combined with FIRST. and LAST. group processing. A script might retain 70+ variables across multiple processing steps. Miss one initialization and the entire downstream calculation drifts.Deeply nested conditional chains. Threshold-based discount calculations with 3 to 4 levels of IF-THEN-DO nesting, where each threshold value (0.82, 0.75, 0.50) is a business-critical number.When you feed these scripts to an LLM, three failure modes emerge reliably.Truncation. The LLM generates the first 30 of 143 mappings and stops. No error, no warning. Just silent data loss.Hallucination. The LLM produces a mapping table with the right structure but wrong values. It generates plausible-looking numbers that aren't the actual product codes.Omission. Stateful logic like RETAIN variables, BY-group processing, and conditional initializations gets quietly dropped because the LLM doesn't recognize its significance in the SAS execution model.These aren't edge cases. On our test set of real client scripts, they affected the majority of scripts over 300 lines.The Architecture: LLM as SurgeonThe core insight that shaped the entire system was simple. Separate what to generate from what must be exact.LLMs are exceptional at understanding semantic intent. They can read a 700-line SAS script and tell you "this calculates motor insurance premiums using vehicle characteristics and prior claims history." They can generate well-structured Python modules with business-meaningful names like premium_calculator.py and claims_aggregator.py. What they cannot do reliably is copy 143 numeric codes without error.That realization drove a four-layer architecture, and each layer exists because of a specific problem I needed to solve.\The first thing I needed was a way to understand what SAS code actually does, without involving an LLM at all. So I built a rule-based semantic parser with 22 operation types (READ, WRITE, FILTER, JOIN, MERGE, AGGREGATE, SORT, and more) and specialized parsers for 15+ SAS procedures. No AI, just regex and domain knowledge. It covers 95%+ of SAS constructs, measured across thousands of operations in production scripts.Critically, this parser also runs deterministic pattern extraction, pulling out SELECT/CASE mappings, RETAIN variable lists, and conditional block structures into machine-readable objects. These become the source of truth that no LLM is allowed to paraphrase.With that structured understanding in hand, the next question was how hard is this script to migrate, and what does it actually do for the business? Here, I used the LLM for what it's genuinely good at. Reading a script and telling me "this is a claims processing pipeline in the motor insurance domain." Meanwhile, deterministic analyzers scored convertibility on a 0-100 scale using line-based scoring across 41 predefined categories. The LLM provided the business context. The rules provided the objective measurement.And then, the generation itself. This is where the LLM earns its place. It generates the semantic structure of the target code. Module boundaries, function signatures, processing flow.The validation layer is what makes the difference. A pipeline of deterministic validators and transformers runs post-generation, checking pattern coverage, verifying value accuracy, and injecting or replacing code where the LLM fell short. How deep this layer needed to go became the defining challenge of the project.Each layer feeds the next with progressively richer context. By the time the LLM generates code, it has access to structured operations, extracted patterns, and dependency context from the deterministic layers, plus business logic summaries from earlier LLM analysis. Its job is narrowed to what it does best. Reasoning about code structure and producing a coherent architecture. The validation layer then verifies every detail the LLM was never asked to get right on its own.The Generation Pipeline: Four PivotsThe four-layer architecture was intentional from the start. But the specifics of how each layer needed to work evolved through four pivots, each triggered by something real-world testing revealed.Pivot 1: Step-wise Chunking to Semantic ModulesThe first generation approach was straightforward. Split a large SAS script into chunks of roughly 30 operations each, generate PySpark for each chunk sequentially, and concatenate the results.The numbers told the story.| Metric | Step-wise | Semantic ||----|----|----|| Code completeness | 65% | 85–90% || Critical patterns implemented | 0 of 14 | 13 of 14 || Generated code (avg) | 320 lines | 1,194 lines || File naming | step_01_process.py, step_02_process.py | premium_calculator.py, claims_aggregator.py || Manual fix time | 6–9 hours | 2–3 hours |Step-wise chunking produced code that was structurally fragmented. The LLM had no visibility into what came before or after each chunk, so it couldn't make coherent architectural decisions. Variables defined in chunk 1 were undefined in chunk 3. Processing flows that should have been a single module were scattered across arbitrary boundaries.The semantic approach flipped the model. Instead of splitting by operation count, the system first asks the LLM to design the module architecture (given the full script's operations and business logic), then generates each module independently with full context about its role in the whole.The lesson: Give the LLM architectural decisions to make, not arbitrary chunks to fill. How you decompose the problem shapes everything downstream.Pivot 2: Operations-based to LOC-based Token EstimationA subtle but costly bug. The system estimated token usage based on operation count (formula: BASE + operations x 75). This worked for average scripts but failed on scripts with high line-to-operation ratios, particularly dense data steps with many assignments per operation.The discovery was accidental. When I added a second LLM provider with different token limits, I found that 21.6% of client scripts were being unnecessarily chunked, split into multiple passes when they could have been generated in a single call. Those scripts took 2 to 5 minutes instead of 30 to 60 seconds.I switched to LOC-based estimation (BASE + SAS_LOC x 47), calibrated against actual token counts from real runs. The correlation was 0.942. Single-pass generation rate jumped from 67.6% to 91.9%.That 21% error rate was invisible until I measured actual token counts against the estimation model. One in five scripts was taking 3 to 5 times longer than necessary, and nothing in the logs suggested a problem.Pivot 3: Pure LLM to Hybrid Extraction + InjectionThis was the pivotal moment. A production script contained a SELECT/CASE block with 143 WHEN clauses, each mapping a set of numeric product codes to an insurance category. The LLM generated approximately 10% of them.Not because it couldn't. When asked to generate just the mapping table in isolation, it performed better. But in the context of generating an entire module (imports, class structure, helper functions, processing logic, and the mapping table), it ran out of output budget and silently truncated the table.Worse, there was no signal that truncation had occurred. The generated code was syntactically valid. It just covered 14 of 143 business rules.The solution came together in stages.\First, I built rule-based parsers that could extract every SELECT/CASE mapping, every RETAIN variable, and every conditional block from raw SAS into structured objects. No LLM needed. These ran in the semantic parser, so the same extracted patterns could feed any target platform: PySpark, Snowflake, dbt.Next, I fed those extracted patterns into the LLM prompt with explicit counts. "This script contains 143 WHEN clause mappings across 4 SELECT/CASE blocks. DO NOT OMIT any mappings." That alone pushed coverage from ~10% to 60-70%. Better, but not enough for real-world use.The real safety net came after generation. A validator counted how many of the extracted patterns actually appeared in the generated code, computing both a coverage percentage (are the patterns present?) and a value accuracy percentage (are the values correct?). If coverage fell below 95%, the missing patterns were injected directly into the generated code using AST-based insertion. Find the right function, find the right insertion point, generate the correct PySpark syntax from the extracted objects.Results on three client scripts:| Script | Size | Patterns | Coverage ||----|----|----|----|| Script A | 706 lines | 143 WHEN clauses, 72 RETAIN vars | 100.0% || Script B | 1,495 lines | 12 SELECT/CASE mappings | 94.5% || Script C | 451 lines | 4 RETAIN vars, FIRST./LAST. | 100.0% |Average: 98.2%, up from ~10% for large pattern sets.The lesson: The LLM was great at understanding what a mapping table does. It just couldn't copy 143 numbers without scrambling them. Once I accepted that, the design became obvious. Extract the data deterministically, let the LLM handle the structure.Pivot 4: Fixing Coverage Wasn't EnoughThis was the most surprising failure. After Pivot 3, pattern coverage was near-perfect. But when I validated the actual values in the generated code against the source, accuracy on large pattern sets was 28.6%.The LLM was generating .isin() calls with the right number of values in roughly the right positions, but the values themselves were wrong. Transposed digits, reused codes from adjacent mappings, plausible-looking numbers that didn't exist in the source.The most revealing detail was this. The LLM often put the correct values in code comments while generating incorrect values in the executable code. It understood the logic. It just couldn't maintain precision across 143 entries during code generation.I measured the accuracy threshold across different pattern sizes:| Pattern Size | LLM Value Accuracy ||----|----|| < 20 WHEN clauses | 80–90% || 20–50 clauses | 60–70% || 50–100 clauses | 30–40% || > 100 clauses | 25–30% |The solution was a template-based transformer that activates when value accuracy drops below 80%. It locates the LLM-generated SELECT/CASE sections using AST parsing (with regex fallback), removes them, and replaces them entirely with code generated deterministically from the extracted pattern objects. No LLM involved in the replacement. Pure template formatting.| Metric | Before | After ||----|----|----|| Value accuracy | 28.6% | 100.0% || Pattern coverage | 100% | 100% || Overhead | — | < 500ms |The transformer pipeline now has a clean separation. One transformer handles coverage by injecting missing patterns. Another handles accuracy by replacing wrong values. They run in sequence, each solving a distinct failure mode.Without value-level validation, I would never have known. The patterns were all there. 100% coverage. The code existed. It just wasn't correct.Scaling: Parallelizing LLM PipelinesWith generation quality solved, the bottleneck shifted to speed. A batch of 34 scripts took 33 minutes when processed sequentially. For enterprise migrations with 200+ scripts, that meant hours of wall-clock time.The obvious solution, making everything async, has a non-obvious trap when your bottleneck is an external API with rate limits. I learned this the hard way. My first fully concurrent configuration triggered 72 rate-limit errors in a single batch run. Each error added a 60-second retry delay from the SDK, turning a speed optimization into a slowdown.The problem was compounding. I had designed a two-level parallelism model. At the outer level, multiple scripts generate concurrently. At the inner level, within each script, multiple modules generate concurrently. These levels multiply. Five concurrent scripts times five concurrent modules means up to 25 simultaneous API calls. Our endpoint had a rate limit of 20 requests per minute.The production-safe configuration landed at 3 scripts x 5 modules = 15 peak concurrent calls, staying under the 20 RPM limit with a 25% safety buffer.Batch results:| Metric | Sequential | Parallel | Improvement ||----|----|----|----|| 34-script batch | 33 min | 19 min | 42% faster || Metadata extraction (34 scripts) | 10.2 min | 0.9 min | 11x faster || Single script, 5 modules | 4.6 min | 2.1 min | 2.2x faster |But scale also surfaced edge cases. A 49-script batch run hit two problems.First, a single script with 265,000 tokens that exceeded every handling tier. The system had no graceful degradation path and simply failed. This drove the design of a multi-tier generation strategy: single-pass for small scripts, modular for medium, multi-pass chunking for very large, and graceful rejection with guidance for scripts that exceed all tiers.Second, 21 rate-limit errors from sequential generation of very large modules. Individual modules of 175,000 to 191,000 tokens sent back-to-back exhausted the token quota even within the concurrency limit. The fix required token-aware throttling, tracking not just the number of concurrent calls but the total tokens in flight.Parallelizing LLM pipelines, it turned out, was less about async/await and more about resource scheduling. Rate limits, token budgets, and the compounding effect of nested parallelism that no load test on toy data would have revealed.The Surprises: What I Didn't ExpectEvery real-world system has war stories. Here are the ones that changed how I think about LLM-powered engineering.The one-line fix that caused a 30x improvement. The semantic parser had a bug where it matched the last run; statement in the file instead of the first when computing operation boundaries. The fix was adding a single break statement to a loop. Average operation span accuracy went from 8 lines to 250 lines, and one script's chunking dropped from 57 segments to 15. The entire downstream pipeline improved because of one misplaced loop termination.The prompt injection nobody intended. SAS code uses %variable syntax for macro variables. When I included SAS examples in LLM prompts, Python format strings like {variable_name} inside the examples were interpreted as Python string placeholders, causing KeyError exceptions. The fix was trivial (escape the braces), but the failure mode was silent. The prompt was being sent with mangled examples, and the LLM was still generating something, just with degraded quality. I only caught it because generation quality was worse than expected on scripts with specific macro patterns.Temperature zero doesn't mean deterministic. During dbt generation testing, two identical LLM calls on the same script, same prompt, same model, temperature=0.0, produced different schema names. One correctly extracted the schema from a comment block. The other used a default placeholder. This led to a hard architectural principle. Any value that must be consistent across runs must come from the deterministic parser, never from LLM inference. I built a post-generation validation and transformation pipeline specifically to enforce this. Detect placeholder values, replace them with parser-extracted values, deterministically. Prompt improvements got us from 50% to 75% consistency. Post-processing got us to 100%.The 6-bug launch. The semantic architecture, the pivotal redesign that delivered 85-90% completeness, was completely non-functional at launch. Six bugs. Passed a list instead of an object, called three non-existent methods, had Python format string collisions in prompts, and contained an incomplete stub that still generated the old naming format. All six were fixed in a single release cycle. The lesson wasn't "test more" (though yes). It was that the architecture was right even when the implementation was wrong. Every metric improved once the bugs were fixed. Sometimes you need to ship the design to validate it.The complexity scorer that couldn't see complexity. A 706-line production script containing 64 business rules, hash objects (an advanced data structure), 70+ retained variables, and a 429-line single processing block scored "Low Complexity." The scorer counted keywords, not internals. It saw "2 SELECT statements" instead of "64 WHEN conditions inside 2 SELECT statements." Migration effort was estimated at 2 to 3 days when the actual work took 5 to 8 days. Across a 100-script portfolio, this systematic underestimation affected roughly 25-35% of scripts.Design Principles That EmergedThese aren't principles I started with. They're principles the system taught me.Use LLMs for semantics, determinism for data. If you need exact values, don't ask the LLM. Extract them yourself and inject them into its output.Validate values, not just structure. "The function exists" and "the function is correct" are different claims.Extract once, use everywhere. Platform-agnostic pattern extraction feeds every downstream generator. Not premature abstraction, just good separation.Post-process, don't prompt-engineer. Prompt improvements plateau. Deterministic post-processing reaches 100%.Measure before you parallelize. Concurrency compounds in non-obvious ways when your bottleneck is an external API.Ship the design, iterate the implementation. Six bugs at launch, four pivots, ~1,550 lines deleted. Commit to your architecture.The 80/20 TrapLLMs get you to 80% fast. The first time you see a 500-line SAS script translated into clean, well-structured PySpark in 30 seconds, it feels like magic. And for simple scripts with straightforward data transformations, basic aggregations, and linear processing flows, that 80% is genuinely useful.But the remaining 20% is where the product lives or dies.The 143rd mapping in a lookup table. The 71st retained variable. The threshold value that determines whether a policyholder gets an 18% or 25% discount. These are the details that make enterprise code correct, not just plausible. And they're exactly the details that LLMs handle least reliably.Every accuracy failure I described in this article came from that last 20%. The code always looked right. It would have passed a code review focused on architecture and style. The values were just wrong. And in regulated industries, wrong values aren't a minor bug. They're a compliance incident.The hybrid approach isn't a workaround for LLM limitations. It's what the architecture looks like when you take the last 20% as seriously as the first 80%. And it generalizes well beyond code migration. Anywhere AI-generated output must be verifiable, the same pattern applies. Let the model reason about structure, then use deterministic systems to guarantee the details.That's the system's half of the story. The other half is human. Every script the system generates goes through an engineer who reviews the code, executes it, compares the output against the original, and signs off before it reaches production. In a regulated industry, that accountability is non-negotiable, and it shouldn't be.What the system changes isn't whether a human is involved, but what they spend their time on. Without automation, an engineer spends 5 to 8 days per script, most of it mechanical rewriting. With the system, that same engineer spends 2 to 3 hours focused on what actually requires human judgment. Business logic correctness, edge case handling, and the confidence to say "this is ready."If you're building LLM-powered tools for production, the question worth asking isn't "how good is the model?" It's "where exactly does the model's output stop being trustworthy, and what am I going to do about it?"The system described in this article has been tested on real migration projects totaling 100+ scripts across multiple enterprise clients. All metrics cited are from runs on real client data, not synthetic benchmarks.\