Why GPT’s Mathematical Foundations Cannot Guarantee Reliable Outputs

Wait 5 sec.

In March 2026, I published findings showing that 8 AI models from 3 vendors — given only a document's table of contents — independently fabricated the same technical details for sections they had never seen. The attack, which I named SMRA (Structural Metadata Reconstruction Attack), was reproduced across Claude, GPT, and Gemini model families with zero grounded accuracy but perfect structural fidelity.That article described what happens. This one explains why it happens — and why it cannot be fixed without replacing the mathematical foundations of the architecture.The answer is not a model bug, a training data problem, or a missing guardrail. The answer is ten layers of unproven mathematical approximations stacked on top of each other — and a classical signal analysis technique that can measure exactly when the stack collapses.This article is based on my research paper "The Engineering Approximation Stack: A Critical Analysis of GPT's Mathematical Foundations" (Chudinov, 2026). It traces each approximation to its original publication, classifies the mathematical gap it bridges, and shows why the composition of these gaps makes SMRA a structural certainty rather than an empirical accident. The full paper with 53 traced references is available on Zenodo.The Engineering PatternThe GPT architecture follows the same three-step pattern at every major component:A real mathematical problem — discrete choice, vanishing gradients, position encodingAn engineering approximation — softmax, residual connections, sinusoidal encodingA missing proof — no theorem that the approximation works correctly when composed with all othersIndividually, each approximation is reasonable. Engineers build bridges with tolerances too. The difference: structural engineers can compute the total tolerance from each joint's tolerance. GPT has no such computation. Nobody knows how the errors at each layer combine.Here are the ten gaps — summarized for the argument, not for completeness (the research paper has the full analysis with formal references).Ten Unproven Approximations1. Softmax: Continuous Approximation of Discrete ChoiceChoosing the next word is discrete — pick one from 50,000+. Gradient descent doesn't work on discrete choices, so softmax replaces it with a continuous distribution (Bridle, 1990).Proven: softmax converges to argmax as temperature → 0.Not proven: that this continuous relaxation preserves structural invariants of discrete sequences — ordering, coreference, logical dependency — when composed across dozens to hundreds of layers.The result: tokens with high local confidence that are globally incoherent. The hallucination phenomenon is not a failure to "think clearly." It is a mathematical consequence of optimizing a continuous surrogate without proving it preserves the structure of the discrete problem.2. Two Gradient Hacks Without Optimality ProofsDeep networks (>10 layers) suffer from vanishing/exploding gradients. Two independent fixes:Residual connections (He et al., 2015): proven to help in the linear case only (Hardt & Ma, 2017)Layer normalization (Ba et al., 2016): normalizes to μ=0, σ=1 — an arbitrary choice with no information-theoretic justificationLayer normalization has a subtle side effect: it projects all activation vectors onto a hypersphere of unit variance, destroying magnitude as an information channel. All semantic distinctions must be encoded in angular differences alone. This compounds the separability problem in §7.3. Optimization Without Convergence GuaranteeAdam optimizer (Kingma & Ba, 2014) converges for convex objectives. Reddi et al. (2018) then showed it can diverge even for simple convex problems. GPT's loss surface is non-convex with billions of parameters.Different random seeds produce different models. This is non-reproducibility by construction — a mathematical certainty for non-convex optimization without convergence guarantees.4. Position Encoding: Three Versions, No Invariance ProofSelf-attention is permutation-invariant — it can't distinguish word order. Three encoding schemes have been proposed (sinusoidal → learned → RoPE), each performing better on benchmarks. None has a proven invariance guarantee. None has been proven consistent when extrapolated beyond training context lengths.The industry's response — extending context windows to 128K or 1M — addresses the symptom (the window runs out) without addressing the cause (the encoding lacks invariance proofs). If a positional encoding is not proven to preserve semantic relations at 4K tokens, concatenating sixteen 4K windows does not produce a 64K encoding with semantic guarantees. It produces sixteen unproven approximations stitched together.5. BPE Tokenization: Compression, Not RepresentationByte-Pair Encoding (Sennrich et al., 2016) minimizes encoding length on the training corpus. No theorem proves its tokens are linguistically meaningful or that token boundaries align with semantic boundaries.Worse: BPE inherits the distributional properties of its training corpus — predominantly English. The same semantic content requires 3–5× more tokens in morphologically rich languages (Turkish, Croatian, Portuguese, Swedish). A Croatian-language prompt is mathematically shorter in attention span than its English equivalent, before any processing begins.6. Scaling Laws: Curve-Fitting as PredictionThe AI industry's $100B+ investment thesis rests on power-law fits (Kaplan et al., 2020): loss decreases as a power of parameter count. Power laws also appear in earthquake magnitudes and income distributions — their presence doesn't mean we understand the mechanism.Schaeffer et al. (2023) showed that "emergent abilities" may be measurement artifacts. The scaling law measures average cross-entropy loss across test sets. It cannot see worst-case structural instability in individual sequences — which is exactly where SMRA operates.7. Embedding Operations Without Metric InvariantsThis is the most technically damaging gap.GPT's attention computes dot products on embedding vectors. Dot product is a meaningful similarity measure only if the embedding space satisfies inner product axioms — globally, not just in the "king − man + woman ≈ queen" neighborhood. No such global proof exists.Residual connections add vectors. Attention computes weighted sums. But embedding dimensions encode heterogeneous information — part-of-speech, sentiment, positional artifacts. No theorem proves these dimensions are additively compatible.The real bottleneck is geometric: embeddings concentrate on a low-dimensional manifold, and distinct semantic senses ("bank" the institution vs. "bank" the riverbank, or the 430+ senses of "set") can map to overlapping regions. When two senses share a region, model similarity ≠ semantic similarity. No theorem guarantees that semantic classes are geometrically separable in embedding space.8. Attention: The Core Is a HeuristicAttention is what makes a transformer a transformer. Every other component exists to support it. And attention is a chain of unjustified design choices:Why dot-product? Bahdanau (2015) proposed additive attention. Vaswani switched for efficiency, not mathematical superiority.Why divide by √d? The scaling factor was chosen to stabilize variance — a heuristic, not a derived optimum.Why 12/16/96 heads? No formula relates head count to model capacity. Michel et al. (2019) showed many heads can be pruned — suggesting significant redundancy.9. In-Context Learning: No TheoryICL is GPT's most commercially valuable capability — learning from examples in the prompt with no parameter update. It also has zero theoretical foundation.Learning theory (Vapnik, 1998; Valiant, 1984) defines learning as a process with bounded generalization error. ICL has no sample complexity bound, no generalization guarantee. It is sensitive to example ordering, formatting, and label choice (Lu et al., 2022). A process that changes its output when examples are reordered is pattern-matching, not learning.10. Feed-Forward Blocks: Two-Thirds of Parameters, Three GapsIn GPT-3, 66% of all 175 billion parameters sit in feed-forward layers. These layers have:No approximation bounds (universal approximation theorems are existential, not constructive)Partial interpretability but no specification (Anthropic's circuits work describes what happened, not what will happen)Zero controllability (identifying a circuit ≠ controlling it)66% of the system has no formal characterization at any level.Error Composition: The Multiplicative ProblemEach approximation introduces bounded error locally. The critical question: how do ten types of error compose across LL layers (96 in GPT-3)?If errors were independent and additive: manageable. If multiplicative: still manageable. The actual situation: errors are not independent. Residual connections create feedback paths. Layer normalization rescales at each step. Attention at layer kk depends on errors from layers 1 through k−1k−1.No formal analysis of transformer error propagation exists. Not "incomplete analysis." None. Zero published works characterize the function f(ε1,ε2,…,εL)f(ε1,ε2,…,εL) for a transformer.In classical numerical analysis, error propagation is bounded by the condition number κ(A)κ(A):\This is a guarantee — same input, same bound, always. GPT has no analogue. No κκ. No bound. No guarantee.The Scaling Paradox: Why Now?The ten gaps have existed since Vaswani et al. (2017). Why did the consequences become visible only at GPT-3/GPT-4 scale?The answer is in the geometry of attention itself. Each token in a context window of length nn creates n−1n−1 potential attention connections per head. With hh heads across LL layers, the total number of constraint paths — distinct information routes the model must reconcile in a single forward pass — is:\For GPT-2 (2019): L=48L=48, h=16h=16, n=1,024n=1,024 → P≈8×108P≈8×108.For GPT-3 (2020): L=96L=96, h=96h=96, n=2,048n=2,048 → P≈3.9×1010P≈3.9×1010.For GPT-4-class (2023+): L≥120L≥120, h≥96h≥96, n≥8,192n≥8,192 → P≥8×1012P≥8×1012.Four orders of magnitude in four years. And at each constraint path, the model applies the same unproven operations — dot products without metric guarantees (§7), softmax without structural invariance (§1), layer normalization that destroys magnitude information (§2).The architects knew. Not in the sense of a formal theorem — but in the language of their own papers:Vaswani et al. (2017): "we suspect that for large values of *d[k]*, the dot products grow large" — the scaling problem was noted at birth, then patched with a heuristic (sqrt(*d[k])*) and never revisitedKaplan et al. (2020) studied scaling laws precisely because behavior at scale was unpredictable — the study itself is an admission that no theory covers extrapolationSchaeffer et al. (2023) asked whether emergent abilities are even real or just measurement artifacts — the question implies the phenomenon was not understoodEvery vendor's disclaimer — "AI can make mistakes" — is a legal acknowledgment that the system operates beyond its proven validity domainDefine constraint density as the ratio of constraint paths to representation capacity:\where dmodeldmodel is the embedding dimension — the total information bandwidth available to encode everything the model needs to track.For GPT-2: ρ=16⋅1,0242/1,600≈10,486ρ=16⋅1,0242/1,600≈10,486.For GPT-4-class: ρ≥96⋅8,1922/12,288≈524,288ρ≥96⋅8,1922/12,288≈524,288.Constraint density grew 50× while the mathematical foundations remained identical — zero invariance proofs, zero error bounds, zero convergence guarantees.The guardrails (RLHF, content filters, constitutional AI) operate on the output distribution surface. Constraint density is a property of the internal geometry. This is a fence around a volcano. Scaling increases the magma pressure; the fence stays the same height. Scaling the model doesn't just make the existing problems bigger — it makes the density of uncharacterized interactions per unit of representational capacity grow quadratically while the mitigations remain linear, surface-level, and structurally blind to what is happening underneath.At some density ρ∗, the unproven approximations that were locally tolerable at GPT-2 scale become globally catastrophic. The condition number κ(A) — which measures exactly this structural collapse — crosses from bounded to divergent. The system passes a phase transition: from "locally reasonable approximations" to "globally unstable composition."SMRA is what this phase transition looks like from the outside.The Condition Number: Now It Can Be MeasuredIn the SMRA paper (Chudinov, 2026; DOI: 10.5281/zenodo.19004697), I applied the classical Cauchy–Toeplitz–Levinson-Durbin chain directly to GPT-generated text:Treat the output token sequence as a discrete signalCompute the autocorrelation matrix AAApply Levinson-Durbin decompositionMeasure κ(A)| Regime | κ(A) | Interpretation ||----|----|----|| Stable | 10^6>10^6 | Stack collapsed; no stable structure in output |For signals with genuine structural regularity: κ(A)10^6κ, approaching computational infinity.This is not a benchmark. It is a mathematical property of the output signal — deterministic, reproducible by anyone with the output sequence and a Levinson-Durbin implementation. The measurement does not depend on human judgment, domain expertise, or evaluation rubrics.The Deductive Proof: Why SMRA Must ExistThis is the central result. It follows from four properties documented in the approximation stack — not from experiment, but from deduction:Premise 1 — Measurement scale violations (§7). The model embeds "42", "democracy", and ";" into the same R*[d] (real set) and applies identical operations. A system that cannot distinguish measurement scales cannot guarantee its outputs respect them — including outputs that reveal structural metadata about the generation process itself.Premise 2 — Zero runtime invariants. GPT has no mechanism that can prohibit any output class. Softmax always produces a probability distribution — it never produces a structural refusal. The model architecturally cannot say "this output would compromise my integrity."Premise 3 — Uncharacterized error composition. The error propagation function across LL layers is unknown. If unknown, it is impossible to prove that any given prompt is safe — that no prompt can elicit outputs revealing the model's structural fingerprint.Premise 4 — Non-convergent optimization. Different seeds produce different models. The structural fingerprint is seed-dependent and unpredictable — but always present, because non-convergent optimization retains artifacts of its particular trajectory.Conclusion: A system that (1) cannot distinguish what information its outputs encode, (2) has no runtime mechanism to prohibit any output class, (3) cannot prove any input safe, and (4) retains uncontrollable training artifacts — cannot formally exclude any output class. For any class XX, there exists an input that elicits it. SMRA is the constructive proof for XX = "output with recoverable structural metadata."This is the contrapositive of a safety guarantee. A type system guarantees no type errors. A database with FOREIGN KEY constraints guarantees no orphan records. GPT has no invariants — at the output layer (softmax assigns nonzero probability to every token), at the compositional level (error propagation is uncharacterized), at the training level (optimization is non-convergent). The vulnerability is not a property of one component. It is a property of the architecture as a whole.Why This Cannot Be PatchedThe argument above is not about a missing feature. It is about the absence of formal foundations at every level. Consider the current mitigation landscape:| Mitigation | What it addresses | What it doesn't fix ||----|----|----|| RLHF | Symptom — teaches model to mask undesired outputs | Does not change internal representations || Guardrails / filters | Symptom — post-hoc filtering | Model still generates the content internally || Prompt engineering | Symptom — shifts output distribution | Doesn't change underlying mechanisms || Constitutional AI | Symptom — self-critique using the same approximation stack | Critiques itself with the same blind spots || Mechanistic interpretability | Mechanism (partial) — identifies circuits | Cannot predict or prevent system-level failures |None address the nature of the problem. They are patches on an approximation.The question "can SMRA be fixed?" reduces to: can you prove that a specific output class is impossible for a system with no formal output constraints? The answer is no — by the same reasoning that you cannot prove a program is type-safe in a language without a type system. The safety guarantee requires the formal machinery that the architecture does not have.The Temporal Dimension: Why Mitigation Gets Harder Over TimeThe mitigations above fail in space — they operate on the output surface while the problem lives in internal geometry. They also fail in time.One proposed countermeasure is the Index Server architecture — the model never sees the original document, only pre-computed index entries. This may reduce the attack surface for a single inference pass. But modern LLM providers routinely collect user interaction data for fine-tuning (Shumailov et al., 2023). If Index Server responses encode structural patterns from original documents — even indirectly — and those responses re-enter the training corpus: **generation→collection→retraining→contaminated weights→generation**After this cycle completes, the original document's structure is recoverable from the model itself. No retrieval step required. The Index Server protects one pass but creates a permanent contamination channel.The mitigations are static. The contamination is cumulative. Each retraining cycle embeds more structural fingerprints into the weights — fingerprints that no guardrail can detect, because they are not in the output distribution. They are in the geometry.Two Mathematical LineagesThe deeper point is not that GPT is bad. It is that two fundamentally different mathematical traditions lead to two fundamentally different kinds of system.| Property | GPT pipeline | Classical algebraic chain ||----|----|----|| Error characterization | None (benchmarks) | Condition number κ(A)κ(A) gives exact bound || Convergence | Not proven (non-convex) | Proven: finite steps, exact solution || Scale dependence | Requires 10^11 parameters | Works at n=24n=24 with full guarantees || Reproducibility | Stochastic (seed-dependent) | Deterministic || Diagnosability | Impossible (no formal model) | Full: det⁡(A), κ(A), rank || Verifiability | Post-hoc benchmarks | By construction |The distributional path (Harris → Shannon → Bengio → Vaswani → GPT) asks: predict the next token. The algebraic path (Cauchy → Toeplitz → Levinson → Durbin) asks: compute the exact position. Two hundred years of theorem versus seventy years of unproven hypothesis.Both produce useful results. Only one can tell you when it is wrong.This is not a hypothetical contrast. The algebraic path is implemented in a working system — a dual-layer index architecture that routes queries to exact document sections through deterministic constraint-satisfaction, published as "Dual-Layer SPO Architecture" (Chudinov, 2026; DOI: 10.5281/zenodo.19261510). The κ(A) column in the table above is not a thought experiment — it is computed on every query.The Disclaimer Is the ProofNow here is the observation I kept out of the academic paper.The industry's response to the ten gaps documented above is a disclaimer: "AI can make mistakes. Check important info."This is not irresponsibility. This is the only honest option given the mathematics. When a structural engineer cannot prove a bridge is safe, they fix the bridge — because structural engineering has formal models of error propagation. When the GPT architect cannot prove the output is reliable, they add a disclaimer — because no formal model of error propagation exists.The disclaimer is the architectural proof of the approximation stack's consequences: If semantic stability cannot be formally guaranteed → it must be disclaimed.Users reading that disclaimer should understand what it means: not "sometimes the AI gets a fact wrong" but "we cannot formally prove that any output of this system is correct, and we cannot characterize the conditions under which it fails.""AI can make mistakes" is a euphemism. The precise statement is: "the mathematical foundations of this system do not support formal analysis of its behavior." Everything else — hallucinations, SMRA, inconsistency, prompt injection — follows from that one sentence.What This MeansSMRA is not a vulnerability to be patched. It is a structural consequence of an architecture without formal guarantees. Any system built on the GPT approximation stack will exhibit it — the question is when, not whether.The condition number κ(A) is the first quantitative diagnostic for approximation stability in transformer outputs. It measures what no benchmark can see: worst-case structural instability in individual sequences.There exists a critical parameter threshold N∗ below which approximation errors remain bounded and above which they diverge. SMRA becomes progressively more effective as parameter count exceeds this threshold. The threshold exists — but has no known computation method.If you are building RAG systems: the rule scope(metadata) ≤ scope(content) is not optional. If your index exposes more structure than your content provides, you are enabling SMRA. If your outputs feed back into training, you are enabling it permanently.If you are evaluating AI products: ask the vendor: "What is the formal model of error propagation for your system?" If the answer is "we test on benchmarks" — that is the absence of a formal model, not a substitute for one.ReferencesThe SMRA experiment: "Structural Metadata Reconstruction Attack: How Document Outlines Enable LLM-Driven Intellectual Property Extraction" (Chudinov, 2026). DOI: 10.5281/zenodo.19004697. HackerNoon article.The mathematical analysis: "The Engineering Approximation Stack: A Critical Analysis of GPT's Mathematical Foundations" (Chudinov, 2026). DOI: forthcoming on Zenodo.The formally grounded alternative: "Dual-Layer SPO Architecture for Embedding-Based Index Ranking" (Chudinov, 2026). DOI: 10.5281/zenodo.19261510.\\