The real breakthrough in robotics is foundation models — not hardware

Wait 5 sec.

Physical AI, also known as embodied AI, is purported to be the next evolution in the quest to build autonomous systems that behave in the real world. This AI paradigm is poised to interact more closely with physical environments, going far beyond chatbots and floor-cleaning robots.Physical AI has entered the hype machine and was the star of CES 2026. Robotics is on the move — quite literally — from the rollout of self-driving vehicles across the US to the deployment of autonomous equipment and cobots on construction sites, in warehouses, and on manufacturing floors.Akhil Docca, senior robotics product marketing managerat Nvidia, tells The New Stack that these developments require more than your typical large language models — they require frontier architecture designed explicitly for physical AI.“Enterprises need AI factories to train and refine large, multimodal foundation models for perception, world modeling, and robotics, supported by end-to-end software for data processing, orchestration, safety, and model lifecycle management,” says Docca.But it doesn’t end there — the next stage of physical AI is pushing advanced sensory perception and physics-aware behaviors into wearables, advanced healthcare, and humanoids. As evidenced in China’s largest lunar New Year’s broadcast, robotic acrobatics now exceed human capabilities.Take robotics company Boston Dynamics’ Spot — what emerged as a dog-like robotic AI on Reddit is now actively deployed in physical spaces, and Boston Dynamics’ work has matured into developing commercial bipedal humanoid robots through its Atlas program.“The real breakthrough in robotics isn’t hardware — it’s in the new class of foundation models physical AI requires.”These robots with generalist training now perform a myriad of real-world industrial tasks and are supported by an ever-improving core software architecture. But is physical AI ready for the average enterprise?As Matt Malchano, vice president, software, Boston Dynamics, tells The New Stack, “Scaling physical AI requires a robust ecosystem of hardware and software integration. When robots integrate diverse sensors, like 360-degree stereo cameras, lidar, and thermal or acoustic payloads, they can perceive environments beyond human capabilities.”While hardware typically gets the limelight in robotics, the software brain is what makes human-like perception and reasoning possible. And recent advances in foundational models explicitly designed for the constraints of the physical world are making physical AI development more accessible.In short, the real breakthrough in robotics isn’t hardware — it’s in the new class of foundation models physical AI requires. And real enterprise adoption will require maturity in model architecture.Below, we’ll zoom in on these advances in physical AI models and compare them. We’ll consider the constraints and lessons learned building physical AI, and pick apart the unique differences in applications across industries.The models underpinning physical AIReality can get messy, Kevin Peterson, CTO, Bedrock Robotics, an advanced autonomy company for the built world, tells The New Stack. “Robotics is all about discovering the messy reality of the world. It’s really about learning how the world works, and it’s usually not how you expect.”He shares his experience having previously worked at Waymo, training machine learning models that power self-driving cars: “At Waymo, early on, I was surprised to see a children’s birthday party in a median between two highways. You don’t expect to see children running around near the highway, but they were there. One of our trucks was hit by lightning.”For this reason, physical AI models must consider the tactile and often unpredictable nature of physical reality. Several model classes, particularly local or edge-capable models, are proving effective for these purposes today. These models span behavioral control, sensory reasoning, world modeling, and beyond for robotics.Behavioral modelsThe Boston Dynamics Spot robot.“Large behavior models (LMBs) are the state of the art at the frontier of physical AI today,” says Boston Dynamics’s Malchano, citing Atlas’s diffusion transformer behavior model as an example of this type. Rather than being programmed for a specific task, LBMs learn from a massive library of human-led demonstrations. This makes them effective for complex, tactile tasks that a traditional robot might struggle with.“LBMs excel at whole-body coordination, stepping, avoiding obstacles, and maintaining balance while performing delicate manual work,” says Malchano. LBMs use action-chunking to predict movements, he adds, making them responsive to disturbances or surprises, often exceeding human response speed in the process.Vision language action modelsOther models specialize in reasoning from sensory input. “The most effective models for physical AI at the edge are reasoning vision language action (VLA) robotics foundation models,” says Nvidia’s Docca. These models can run on board devices and translate sensor input and language-based commands into goals.For example, Nvidia’s Isaac GR00T N is an open-reasoning VLA model that equips humanoid robots with generalized skills in perception, reasoning, and control. VLAs like this generalize well across tasks after fine-tuning, says Docca, yet they hinge on openness. “Models need to be open so developers can post-train these models for specific use-cases using their own data.”Bedrock Robotics’ Peterson also notices language or vision-language models emerging to conduct higher-level planning. “These approaches leverage all the large-model improvements, like reasoning, and then translate those to robot actions by encoding a task embedding,” he says.VLAs designed for reasoning deliver far fewer decisions per second than low-level models that handle real-time motor control for humanoids. As such, the key will be combining the two types to maximize results. “Over the next year or two, the robotics community will get really good at combining these approaches and merging the higher- and lower-level reasoning,” says Peterson.Open world modelsAnother family of models for physical AI is open-world models, which learn representations of environment dynamics to support planning and simulation. These are often necessary in automotive situations, for instance, to ingest multimodal sensors such as cameras, radar, lidar, and ultrasonics.For example, the Nvidia Alpamayo 1 falls into the open world model category. Docca describes it as “a ‘thinking’ world model that can be fine-tuned and distilled into in-vehicle stacks to help cars perceive, reason, and act.” Nvidia Cosmos is another example of an open-world foundation model for world generation.However, there are roadblocks to using open world models, especially when applied at the edge. “There is a lot of buzz around world model style approaches like jointly predicting video with action or a trajectory,” says Peterson. “These techniques improve performance quite a bit, but can be very expensive.”More production use will likely bring more performance enhancements. Peterson says that optimizations — such as skip-training in diffusion or performing prediction in a latent space, rather than a pixel space — will be necessary to make this class of models more feasible in production.Denoising algorithmsOne issue with physical AI models is that robots face near-endless possibilities for simple actions, such as picking things up or responding to an obstruction in their path. This can amount to a noisy mix of signals for physical AI, leading to confusion.“Among the most effective edge-capable models today are generative policies,” says Peterson. These include diffusion-style policies, he adds, which are algorithms that help a robot denoise its environment and decision-making path. “Generative policies handle real-world uncertainty and shifting dynamics more robustly than single-shot predictors,” adds Peterson.Specialized versus general modelsThen, there is the issue of size: should models be highly task-specific or kept generalist and usable for multiple use cases? Experts diverge on this topic.Brian Moore, Co-Founder and CEO at Voxel51, a visual AI platform, tells The New Stack that running at the edge is intrinsic to the problem in physical AI. “You have tight control loops, and it’s not viable to send massive amounts of sensor data to the cloud in real time.”When you bring robotics to the real world, you expose it to rich data and high-dimensional features, including potentially irrelevant ones. That’s what makes specialized models work best, says Moore.“What’s working today are specialized, domain-specific models that can actually be deployed at the edge.”He adds, “There’s excitement around world foundation models and vision language action models, but by definition they’re very large models that are infeasible to operate on edge hardware for many use cases.”However, others foresee the industry moving away from highly specialized toward more comprehensive models, such as those powering generalist humanoids. “Physical AI models are moving from single-task perception toward generalist specialists that can perceive, reason, plan, and act across different environments,” says Docca.Constraints facing physical AIBeyond just the unique models for physical AI, from a software and systems perspective, other components are necessary to make physical AI viable at scale.“Physical AI at scale requires a full-stack accelerated computing platform that connects AI supercomputing in the data center with real-time AI inference at the edge,” says Nvidia’s Docca. For him, physical AI at scale will require a combination of foundational models, simulation environments before production, and high-performance edge computing.“You can’t make a collision-critical decision with internet latency.”There will also be notable differences across industries. “Physical AI development varies by industry because differences in latency, mechanics, tasks, materials, and operating environments shape priorities end to end,” says Peterson. “Ultimately, success comes from aligning the model with the system dynamics and validating at scale.”Failure tolerance differs across domains. Some areas, such as automotive, will require more intensive pre-production evaluations to address safety concerns. “Even 99.9% success is wholly unacceptable in some domains,” says Peterson. “Fixing the long tail of issues in learned systems is also somewhat new territory for the industry.”One industry where physical AI is set to play a significant role is healthcare, says Docca, citing its use in surgical robots, medical imaging, and real-time clinical decision support. Such use cases will require low latency and high reliability to meet regulatory requirements, he adds. Factories and warehouses are actively developing physical AI alongside digital twins, too.Regardless of industry, a big lesson learned is to separate what belongs in cloud infrastructure (best for pre-training) versus what stays on the edge (actual inference). As Peterson says, when it comes to large autonomous machinery, safety-first AI models that can run on the edge matter most. “You can’t make a collision-critical decision with internet latency,” he says.The immediate future for physical AIThe Sharpa North robot that debuted at CES 2026.Recent advances in vision-language-action models and world foundation models have many excited. Yet there are technical constraints on placing these frontier physical AI models at the edge. System-level constraints, such as access to power, data center infrastructure, and relevant training and fine-tuning data, also remain limiting factors.As such, the journey from taking physical AI experiments into production will likely follow a deliberate trajectory, says Docca: “Teams start by building high-fidelity digital twins and simulation pipelines, scale synthetic data to meet training and evaluation goals, then graduate into human-supervised deployments before scaling to fleets.”He adds that data and simulated testing will be critical: “As AI factories scale out on next-generation hardware and AI infrastructure, enterprises will be able to continuously train and update fleets of robots, vehicles, and smart spaces using synthetic data and large-scale simulation.”All in all, experts are optimistic about the future of physical AI and anticipate that builders will extend the possibilities in short order. As Malchano says, “Over the next two to three years, I think physical AI models will continue to expand in capability, adaptability, and real-world usefulness, opening up an increasingly broad set of applications across industries.”The post The real breakthrough in robotics is foundation models — not hardware appeared first on The New Stack.