I Asked 5 LLMs to Write the Same SQL Query. Here's How Wrong They Got It

Wait 5 sec.

So this started because of a Slack argument.Someone on my team pasted a SQL query from ChatGPT straight into a pull request. No testing. No checking. And the query looked fine. Clean formatting, proper aliases, and even had a comment explaining the logic.One problem: it was returning wrong numbers. Off by about 15%. The join logic was subtly broken in a way that silently dropped rows instead of throwing an error.This happens at least once a week now. And every time, the same debate kicks off: "AI-generated SQL is fine, you just need to check it." Cool. But if you need to fully understand and verify every query anyway, what exactly is the AI saving you?I decided to stop arguing about it and actually test it.What I DidI took 10 SQL problems, real ones, the kind that show up in actual data engineering work, and ran each one through five LLMs. Same prompt. Same schema. Same expected answer. Then I graded every single response against the correct output.Not "does it look right." Does it return the right numbers?The models:GPT-5.2 Thinking Claude Sonnet 4.6 Gemini 3.1 Pro Llama 4 Maverick Mistral Large No fancy system prompts. No few-shot examples. No "act as a senior data engineer" preamble. I just asked the question the way a normal person would at 2 pm when they're stuck on a query.The schema: A boring e-commerce dataset. Orders, order items, customers, products. 500K orders, 1.2M line items, 80K customers. The kind of tables every data person has seen a thousand times.CREATE TABLE orders ( order_id INT PRIMARY KEY, customer_id INT, order_date DATE, status VARCHAR(20), -- 'completed', 'cancelled', 'refunded', 'pending' total_amount DECIMAL(10,2), discount_amount DECIMAL(10,2), region VARCHAR(50));CREATE TABLE order_items ( item_id INT PRIMARY KEY, order_id INT, product_id INT, quantity INT, unit_price DECIMAL(10,2), returned BOOLEAN);CREATE TABLE customers ( customer_id INT PRIMARY KEY, signup_date DATE, plan_type VARCHAR(20), -- 'free', 'basic', 'premium' country VARCHAR(50), is_active BOOLEAN);CREATE TABLE products ( product_id INT PRIMARY KEY, category VARCHAR(50), subcategory VARCHAR(50), product_name VARCHAR(200), cost_price DECIMAL(10,2));Every "correct" answer was written by me and verified against the actual data. If a model's query returned different numbers, it failed. Simple as that.The 10 Queries (and Where Things Got Ugly)Q1: Basic aggregationRevenue by region for completed orders in 2025.All five got it right. Honestly, if any model had whiffed on this, I would've just closed my laptop and gone outside. Moving on.Q2: Customer retention cohortFor each monthly signup cohort, what % placed at least one order in each of the next 3 months?2 out of 5.This is where I started taking notes. GPT-5.2 and Claude nailed it. Gemini's version ran fine, but used orderdate instead of signupdate for the cohort definition. Completely different question. Llama had a syntax error in the generate_series call (come on). And Mistral? Mistral gave me a cumulative retention curve instead of a month-by-month retention.Here's what bugged me most about Mistral's answer: the output table looked totally reasonable. Nice clean percentages, declining over time as you'd expect. If you didn't already know the right answer, you'd ship it. That scared me a little.Q3: Running total with yearly resetRunning sum of revenue per customer, resetting every January 1st.4 out of 5. Window functions are well-represented in training data, I guess. Llama forgot to partition by year, so the total just… never reset. One of those bugs where you'd stare at the output for 10 minutes before you noticed.Q4: Gaps and islandsFind consecutive days where daily revenue exceeded $50K. Return the start date, end date, and streak length.1 out of 5. Yeah.This is the classic date - row_number() trick, and apparently only Claude knows it. GPT-5.2 tried a self-join approach that almost worked but miscounted streaks crossing month boundaries by one day. The other three produced queries that ran, returned data, and were wrong in three completely different ways.I'm starting to see a pattern here: if a problem requires knowing a specific algorithmic technique, most LLMs just… don't. They generate something that structurally resembles SQL for that type of problem, but it doesn't actually solve it.Q5: SessionizationGroup orders into "sessions" where a 30-day gap between orders starts a new session. Return session numbers and order counts.2 out of 5. GPT-5.2 and Claude got it. Gemini went the recursive CTE route, which was technically correct but timed out on 500K rows. So, no. Llama and Mistral both broke on customers with exactly one order. Edge cases, man.Q6: Percentage of total within groupsEach product's revenue as a % of its category total, excluding returned items.4 out of 5. Straightforward window function stuff. Llama applied the returned = false filter in the wrong place, after the percentage calc instead of before, so the denominator was inflated. This is exactly the kind of bug that makes it past code review because the query looks logically structured.Q7: Time-constrained funnelOf Q1 2025 signups, what % ordered within 7 days? 30 days? 90 days?3 out of 5. GPT-5.2, Claude, and Gemini all got it (different approaches, same answer, love to see it). Llama forgot to find the first order date per customer, so repeat buyers threw off the funnel numbers. Mistral made the buckets exclusive instead of cumulative. These are the kind of mistakes a junior analyst would make. And honestly, I've made both of them at various points in my career. But that's the thing. I learned from making them. The model just… makes them again next time.Q8: Year-over-year with NULL handlingMonthly revenue 2024 vs 2025, side by side, with YoY growth %. Don't divide by zero.3 out of 5. The NULL/zero trap got Gemini (used COALESCE(prior_year, 0) in the denominator, which gives you a meaningless growth rate) and Llama (jumbled the FULL OUTER JOIN in a way I honestly couldn't follow even after reading it three times).Q9: Recursive hierarchical queryBuild the full category path for each product using a self-referencing parentcategoryid column. Like "Electronics > Computers > Laptops."1 out of 5.Recursive CTEs are apparently where LLMs go to die. Only GPT-5.2 got it right. Claude walked the tree in the wrong direction and got reversed paths. The other three had various issues: wrong base cases, infinite loops, and missing anchor members. I was genuinely surprised by this one because recursive CTEs aren't that exotic. But I guess they're underrepresented enough in training data that the models can't reliably generate them.Q10: The Friday afternoon special"Find customers who purchased in 3+ categories, spent over $1K total, but whose average order value is declining. Compare their last 3 orders to their first 3. Only include customers active for 6+ months."0 out of 5.Zero. None. Not a single model got this fully right.And look, this is a hard query. Multiple joins, multiple conditions that interact with each other, edge cases around customers with fewer than 6 orders, potential date ties when determining "first 3" and "last 3." I get it.But this is also exactly the kind of question a stakeholder asks you on a Friday at 4 pm. This is the real world. And right now, no LLM I tested can handle it.The most common failure was the "declining AOV" comparison. Models either compared the wrong orders (didn't handle date ties), computed the average wrong (averaged order totals instead of per-order item spend), or just silently ignored the 6-month activity requirement.The Scoreboard| Model | Got it Right | Partial Credit | Wrong | Score ||----|----|----|----|----|| GPT-5.2 Thinking | 7 | 2 | 1 | 70% || Claude Sonnet 4.6 | 7 | 2 | 1 | 70% || Gemini 3.1 Pro | 5 | 2 | 3 | 50% || Llama 4 Maverick | 4 | 2 | 4 | 40% || Mistral Large | 4 | 2 | 4 | 40% |"Partial credit" = the approach was right, but edge cases or subtle logic errors produced wrong numbers.GPT-5.2 and Claude are basically tied. Both are strong on standard patterns, both are weak on the genuinely hard stuff. Gemini is a tier below. Solid fundamentals, but trips up more often on logic. Llama and Mistral are fine for simple queries, but I wouldn't trust them on anything with real business logic.The Thing That Actually Worries MeIt's not that LLMs get SQL wrong sometimes. Everything gets stuff wrong sometimes. I get stuff wrong sometimes.What worries me is how they get it wrong.When I write a bad query, it usually errors out. Syntax error. Type mismatch. Column not found. Something loud and obvious. When an LLM writes a bad query, it runs perfectly, returns a clean result set with nice column names and reasonable-looking numbers, and is just… wrong. Silently. Confidently.I keep calling this the "looks right" problem, and I think it's the most dangerous thing about using LLMs for data work specifically. In software engineering, bad code tends to crash. In data engineering, bad queries tend to return plausible-looking garbage that nobody questions until someone makes a bad business decision based on it.What I Actually Think NowLook, I'm not anti-LLM for SQL. I use them myself literally every day. They're great for scaffolding queries when I know the pattern but forgot the syntax. Generating boilerplate CREATE TABLE statements. Explaining someone else's ugly query back to me in plain English. First drafts that I then rewrite.But after running this experiment, a few things crystallized for me.Verify against known output, not against vibes. Don't just eyeball the result and go "yeah, that looks about right." Run it against data where you already know the correct answer. If you can't do that, you shouldn't be shipping it.Complexity kills accuracy. For simple stuff (Q1, Q3, Q6), LLMs are reliable. For anything with multiple interacting conditions (Q4, Q9, Q10), treat the output like an untested pull request from someone who doesn't understand your data.You need to know SQL more now, not less. I've heard the take that LLMs mean analysts don't need to learn SQL deeply anymore. After this? I think it's the opposite. The only way to catch a confident wrong answer is to actually understand the logic yourself. If you can't read the query and spot the bug, the AI isn't helping you. It's creating a liability.MethodologyIdentical prompt to every model, verbatimDefault chat settings, no API tuningFirst response only, no retries, no "actually can you try again"Correct answers written by me, verified against actual data with known outputTested over 10 days in Feb 2026. Models change fast. This is a snapshotSchema DDL and all 10 prompts in a GitHub gist [link] so you can reproduce itIf you test this on a model I missed, or run it again in 6 months when these models are smarter, I want to see your results. Seriously. Comments or Twitter don't matter.These models are getting better, fast. A year ago, the scores would've been way worse. But right now, in February 2026, the answer to "can I trust LLM-generated SQL?" is: depends on the query, and you'd better check either way.\