Human Cognition Can’t Keep Up With Modern Networks. What’s Next?

Wait 5 sec.

IBM, the venerable tech company, has been on an acquisition binge in the last few years, buying Red Hat and HashiCorp, and it recently announced plans to buy Confluent.There’s a method to this shopping spree, according to Sanil Nambiar, client engagement lead, AI for networks, at IBM: Assembling the infrastructure organizations will need for AI.“The strategy, obviously, is hybrid cloud, data and AI and automation working together as an architecture,” Nambiar told me in this episode of The New Stack Makers.IBM has invested in what he calls “three foundational platforms” because each offers capabilities essential to AI infrastructure.Red Hat, a hybrid cloud platform, is needed “for that consistent runtime across on-prem and cloud,” he said. HashiCorp offers “life cycle control and policy-driven automation.”And Confluent is for “real-time, contextual, trustworthy data access for AI.”All of these platforms are needed, Nambiar said, because  “AI does not sit on top of chaos and magically fix it. You really need environments which are consistent, infrastructure that is programmable, data that moves in real time.”The Core Challenges of Modern Network OperationsThe new complexity AI introduces has added to the challenges networking operations teams face, Nambiar said.He cited conversations he and his IBM colleagues have had over the last couple of years with customers. One alarming takeaway: “Modern, distributed and software-defined networks have outstripped human cognition because of their increasing complexity.”The tsunami of data is a big issue, he added, leaving teams to decipher “all of these data silos and the data fragmentation that comes with this network complexity.”And then there’s the ever-present skills gap, worsened by the need to learn new, AI native tools. “Monitoring tools themselves have become so sophisticated that they themselves require deep knowledge,” Nambiar said.Veteran ops engineers “have this intuitive knowledge, which is really hard to replicate in systems today. They have this tribal knowledge. It’s really not transferable to new hires.”In short, he said, “You can’t throw people at this problem.”Why Trust Is the Biggest Hurdle for AI in NetworkingThe biggest issue, he said, is trust: being able to trust AI tools to do the job for which they’re built. Data silos, data fragmentation and a lack of real time data lead to an erosion of trust, Nambiar said.“In production environments, the cost of being wrong is really, really high,” he said. “A bad recommendation can cause an outage, an SLA breach, or trigger irreversible change. So when customers ask, ‘What can your AI do for me?’ that’s not the right question that they ask. They actually ask, ‘Can I actually trust when it matters?’”How Agentic AI Can Proactively Prevent Network OutagesNambiar has spent 20 years working in networking and network operations. He’s learned that “major outages don’t just start suddenly. They are preceded by very subtle operational patterns.“You see a gradual increase in drops or retransmits. There are these rising latency variances. There is a queue build up here. There is a resource pressure there. There are repeated micro failures that never cross those static thresholds.”And when an incident does occur, he noted, “most of the time is actually lost before anyone starts fixing anything. Teams spend a lot of time gathering the context, reconstructing the timelines, figuring out what changed, determining who needs to be involved. An AI agent can actually collapse all of these hours of work into seconds.”IBM Network Intelligence: A Network-Native AI SolutionBut it’s not enough to just speed up existing processes, he said, with the help of an AI agent. In September, IBM released IBM Network Intelligence, which the company calls a “network native AI solution.”“We’ve built IBM network intelligence around those principles of proper trust, [and large language model] scaffolding, making sure that we have the right AI tool for the right task,” Nambiar said.The solution pairs LLM reasoning with time-series foundation models. “For instance, LLMs are really not very good at understanding time and any temporal structures or multifamily causalities that exist in networks — and networks are all about time.“So if we can decouple the architecture to use a time-series foundation model, which can really understand time, you can pretrain it to a particular domain — say an MPLS domain or a data center domain or a radio access domain — and then the accurate observations that come out of it can be fed into a reasoning LLM with access to agents and context and grounding documents and dynamic tool calling, etc.“Then you have an architecture which is more accurate. It’s scaffolded with trust, and we can start to use that in operation.”Check out the full episode to delve deeper into the challenges of networking operations in the AI era and how junior engineers might fit into the agentic future.The post Human Cognition Can’t Keep Up With Modern Networks. What’s Next? appeared first on The New Stack.