AI On: 3 Ways to Bring Agentic AI to Computer Vision Applications

Wait 5 sec.

Editor’s note: This post is part of the AI On blog series, which explores the latest techniques and real-world applications of agentic AI, chatbots and copilots. The series also highlights the NVIDIA software and hardware powering advanced AI agents, which form the foundation of AI query engines that gather insights and perform tasks to transform everyday experiences and reshape industries.Today’s computer vision systems excel at identifying what happens in physical spaces and processes, but lack the abilities to explain the details of a scene and why they matter, as well as reason about what might happen next.Agentic intelligence powered by vision language models (VLMs) can help bridge this gap, giving teams quick, easy access to key insights and analyses that connect text descriptors with spatial-temporal information and billions of visual data points captured by their systems every day.Three approaches organizations can use to boost their legacy computer vision systems with agentic intelligence are to:Apply dense captioning for searchable visual content.Augment system alerts with detailed context.Use AI reasoning to summarize information from complex scenarios and answer questions.Making Visual Content Searchable With Dense CaptionsTraditional convolutional neural network (CNN)-powered video search tools are constrained by limited training, context and semantics, making gleaning insights manual, tedious and time-consuming. CNNs are tuned to perform specific visual tasks, like spotting an anomaly, and lack the multimodal ability to translate what they see into text.Businesses can embed VLMs directly into their existing applications to generate highly detailed captions of images and videos. These captions turn unstructured content into rich, searchable metadata, enabling visual search that’s far more flexible — not constrained by file names or basic tags.For example, automated vehicle-inspection system UVeye processes over 700 million high-resolution images each month to build one of the world’s largest vehicle and component datasets. By applying VLMs, UVeye converts this visual data into structured condition reports, detecting subtle defects, modifications or foreign objects with exceptional accuracy and reliability for search.VLM-powered visual understanding adds essential context, ensuring transparent, consistent insights for compliance, safety and quality control. UVeye detects 96% of defects compared with 24% using manual methods, enabling early intervention to reduce downtime and control maintenance costs.https://blogs.nvidia.com/wp-content/uploads/2025/11/UVeye-video-1.mp4Relo Metrics, a provider of AI-powered sports marketing measurement, helps brands quantify the value of their media investments and optimize their spending. By combining VLMs with computer vision, Relo Metrics moves beyond basic logo detection to capture context — like a courtside banner shown during a game-winning shot — and translate it into real-time monetary value.This contextual-insight capability highlights when and how logos appear, especially in high-impact moments, giving marketers a clearer view of return on investment and ways to optimize strategy. For example, Stanley Black & Decker, including its Dewalt brand, previously relied on end-of-season reports to evaluate sponsor asset performance, limiting timely decision-making. Using Relo Metrics for real-time insights, Stanley Black & Decker adjusted signage positioning and saved $1.3 million in potentially lost sponsor media value.Augmenting Computer Vision System Alerts With VLM ReasoningCNN-based computer vision systems often generate binary detection alerts such as yes or no, and true or false. Without the reasoning power of VLMs, that can mean false positives and missed details — leading to costly mistakes in safety and security, as well as lost business intelligence.Rather than replacing these CNN-based computer vision systems entirely, VLMs can easily augment these systems as an intelligent add-on. With a VLM layered on top of CNN-based computer vision systems, detection alerts are not only flagged but reviewed with contextual understanding — explaining where, how and why the incident occurred.For smarter city traffic management, Linker Vision uses VLMs to verify critical city alerts, such as traffic accidents, flooding, or falling poles and trees from storms. This reduces false positives and adds vital context to each event to improve real-time municipal response.https://blogs.nvidia.com/wp-content/uploads/2025/11/Updated-VLM-1-1.mp4Linker Vision’s architecture for agentic AI involves automating event analysis from over 50,000 diverse smart city camera streams to enable cross-department remediation — coordinating actions across teams like traffic control, utilities and first responders when incidents occur. The ability to query across all camera streams simultaneously enables systems to quickly and automatically turn observations into insights and trigger recommendations for next best actions.Automatic Analysis of Complex Scenarios With Agentic AI Agentic AI systems can process, reason and answer complex queries across video streams and modalities — such as audio, text, video and sensor data. This is possible by combining VLMs with reasoning models, large language models (LLMs), retrieval-augmented generation (RAG), computer vision and speech transcription.Basic integration of a VLM into an existing computer vision pipeline is helpful in verifying short video clips of key moments. However this approach is limited by how many visual tokens a single model can process at once, resulting in surface-level answers without context over longer time periods and external knowledge.In contrast, whole architectures built on agentic AI enable scalable, accurate processing of lengthy and multichannel video archives. This leads to deeper, more accurate and more reliable insights that go beyond surface-level understanding. Agentic systems can be used for root-cause analysis or analysis of long inspection videos to generate reports with timestamped insights.Levatas develops visual-inspection solutions that use mobile robots and autonomous systems to enhance safety, reliability and performance of critical infrastructure assets such as electric utility substations, fuel terminals, rail yards and logistics hubs. Using VLMs, Levatas built a video analytics AI agent to automatically review inspection footage and draft detailed inspection reports, dramatically accelerating a traditionally manual and slow process.For customers like American Electric Power (AEP), Levatas AI integrates with Skydio X10 devices to streamline inspection of electric infrastructure. Levatas enables AEP to autonomously inspect power poles, identify thermal issues and detect equipment damage. Alerts are sent instantly to the AEP team upon issue detection, enabling swift response and resolution, and ensuring reliable, clean and affordable energy delivery.https://blogs.nvidia.com/wp-content/uploads/2025/11/Levatas-Compressed.mp4AI gaming highlight tools like Eklipse use VLM-powered agents to enrich livestreams of video games with captions and index metadata for rapid querying, summarization and creation of polished highlight reels in minutes — 10x faster than legacy solutions — leading to improved content consumption experiences.https://blogs.nvidia.com/wp-content/uploads/2025/11/Eklipse-Compressed.mp4Powering Agentic Video Intelligence With NVIDIA TechnologiesFor advanced search and reasoning, developers can use multimodal VLMs such as NVCLIP, NVIDIA Cosmos Reason and Nemotron Nano V2 to build metadata-rich indexes for search.To integrate VLMs into computer vision applications, developers can use the event reviewer feature in the NVIDIA Blueprint for video search and summarization (VSS), part of the NVIDIA Metropolis platform.For more complex queries and summarization tasks, the VSS blueprint can be customized so developers can build AI agents that access VLMs directly or use VLMs in conjunction with LLMs, RAG and computer vision models. This enables smarter operations, richer video analytics and real-time process compliance that scale with organizational needs.Learn more about NVIDIA-powered agentic video analytics.Stay up to date by subscribing to NVIDIA’s vision AI newsletter, joining the community and following NVIDIA AI on LinkedIn, Instagram, X and Facebook. Explore the VLM tech blogs, and self-paced video tutorials and livestreams.