Nobody Reviewed the Model. They Just Reviewed the Code Around It

Wait 5 sec.

Before signing a contract, a customer's security team conducted a vendor review, and one question on their list left our ML lead speechless: “Which exact model weights are running in production, and who reviewed the code that loads them?” We answered the first half. We knew the model's name on Hugging Face. We had no idea which revision we were actually running, because our Dockerfile pulled “latest” from that repo every time we rebuilt the image. And the second half of the question was worse: our loading code passed trustremotecode=True, which meant a Python file written by a stranger, hosted on someone else's repo, ran automatically inside our container every time it started. Nobody on the team could tell the auditor what was in that file, because nobody had ever opened it.That conversation didn't end the deal, but it should have ended a lot sooner, with us catching it ourselves. We'd spent real effort locking down our Dockerfiles pinned base images, no baked-in secrets, and non-root users, and then quietly let an entirely different supply chain walk in through the model loading code without a second glance.The Part of the Supply Chain Nobody Was WatchingTraditional container supply chain advice is mostly about code dependencies: pin your packages, scan your base image, and don't pull untrusted Docker Hub images blindly. That advice still holds, but the AI era adds an artifact type that doesn't fit cleanly into any of it, the model itself. A checkpoint downloaded from a model hub isn't just data the way a CSV or a config file is. Depending on the format, loading it can execute arbitrary code. Pickle-based weight files can construct arbitrary Python objects during deserialization, and even when a model ships in the safer safetensors format, library authors increasingly support trustremotecode, which means the model repo can bundle its own Python modeling code that runs automatically at load time. We'd been treating that flag the way people used to treat curl | bash, a thing everyone does because it's convenient and that almost nobody stops to read first.How We'd Gotten ThereIt started months earlier, ahead of a demo. We needed a model architecture that wasn't yet merged into mainline transformers, and the only path to it was the model author's own custom code, loaded via trustremotecode=True. The demo was successful, and the line of code remained in the Dockerfile build step long after the deadline pressure had subsided. No one scheduled time to revisit it because revisiting it was unimportant until a customer's audit made it important.Two more habits compounded the problem. First, we referenced the model by its repo name with no pinned revision, so the exact weights backing that name could change upstream between builds without triggering anything resembling a code review on our side, the same risk class as an unpinned floating package version, except there's no CVE database tracking model repos the way there is for PyPI packages. Second, we baked the downloaded weights directly into the image during the build, which meant the only record of what we'd actually shipped was whatever happened to be on the hub the day CI last ran.What We ChangedThe fix wasn't exotic, just overdue. We started pinning every model reference to a specific commit revision rather than a branch name, the same way you'd pin a package to an exact version instead of a floating tag:\# before — floating reference, trusts remote code blindlymodel = AutoModel.from_pretrained( "some-org/custom-extractor", trust_remote_code=True,)# after — pinned revision, no implicit remote executionmodel = AutoModel.from_pretrained( "some-org/custom-extractor", revision="a1b2c3d4e5f6", trust_remote_code=False,)\Where the custom architecture code was genuinely necessary, we stopped pulling it live at the container start. We vendored the specific file into our repository, gave it an actual code review the way we would any other dependency addition, and imported it locally instead of letting the hub serve it to us fresh on every build. That has resulted in a tangible cost; we no longer receive upstream fixes to that file automatically, and someone is now responsible for re-pulling and re-reviewing it on our schedule. I believe that's the right trade. A dependency that requires your attention for updates is safer than one that updates itself without your knowledge, even if it can be more inconvenient to maintain at times.We also pinned our base image by digest instead of tag and moved off a community CUDA image we'd picked years earlier because it “just worked,” with no idea who maintained it or how often it got patched:\FROM pytorch/pytorch@sha256:3fa1b2c...d9And we started writing a small manifest alongside every image build—nothing fancy, just a JSON file recording the base image digest, the pinned model revision, and a SHA256 of the weight files actually shipped. It's a long way from a full provenance and attestation pipeline, and that was deliberate. We considered signing images with CoSign and generating proper SBOMs and in toto attestations, but we decided that the complexity of those processes was not justified given our current team size. The manifest provides us with most of the actual value of an honest answer to “what's running in prod?” for a fraction of the workflow's overhead. I would prefer to ship the current version today rather than spend a quarter building the more complex pipeline while having no recordings in the meantime.Where I'd Push Back on MyselfThese efforts are only worth doing for production systems or widely used tools. Pinning revisions and vendoring remote code is friction, and friction has a cost; it slows down exactly the kind of rapid iteration that research work needs. The line I'd draw is whether the artifact touches production data, a customer, or a deployment boundary. Below that line, we can proceed quickly and embrace the more flexible practices. Above it, the few hours of review this process takes are cheap compared to explaining to an auditor that nobody read the code that's been running in your containers for eight months.Key TakeawaysTreat trustremotecode=True as you would curl | bash, convenient, common, and almost never actually reviewed by the people relying on it.Pin model references to a specific revision hash, not a branch or repo name that can change underneath you.Vendor and review any custom remote model code you genuinely need, rather than trusting a live pull for every build.Pin base images by digest, not tag, and know who actually maintains the image you're building on.A lightweight manifest recording exact artifact versions beats no record at all; don't let the ideal of full provenance tooling stop you from doing the basic version now.Closing ThoughtWe spent a lot of energy locking down Dockerfiles and almost none on the model-loading code sitting right next to them because one felt like infrastructure and the other felt like a research detail. I don't think that split holds up anymore. If a model checkpoint can execute code on load the same way a package can, it probably needs the same review gate a new dependency gets, which raises an awkward question most orgs haven't answered yet: whose job is that review, the security team's or the ML team's, and does either one currently think it's theirs?\n \