Is XBOW’s success the beginning of the end of human-led bug hunting? Not yet.

Wait 5 sec.

When news broke that an AI agent named XBOW was leading the HackerOne bug bounty leaderboards, it quickly raised several concerning questions for the cybersecurity industry.Have large language models evolved enough to partially or fully replace human bug hunting? How precisely does XBOW — built by a startup with the same name — work? Were humans involved in producing the output, and if so, to what extent? And ultimately: what does this mean for the future of cybersecurity and the humans who have traditionally performed these jobs?In interviews with CyberScoop, experts from XBOW, HackerOne, and the cybersecurity  industry note that the rapid evolution of large language models is evident in tools like XBOW. These models have quickly become highly effective at core tasks like vulnerability research, threat hunting and adversarial red-teaming. Unlike humans, these models can work continuously — though at significant costs — and solve bugs at much faster rates. At the same time, they stressed that managing an AI bug hunter or red-teaming program still requires a certain amount of human input and intervention. Others have said that XBOW’s work, while impressive, appears to be from racking up wins on low-level, low-impact bugs and that the model would likely struggle on more complex vulnerabilities. While most said XBOW’s capabilities fall short of an existential crisis for human bug hunters and red-team leaders, they also acknowledge that the balance between human and automation in cybersecurity is shifting rapidly underneath the industry’s feet.More machine than manIn a June 24 blog, XBOW head of security Nico Waisman claimed that the tool operates with “no human input” but also acknowledges that, given the hundreds of thousands of potential targets on HackerOne’s platform, the startup “built infrastructure on top of XBOW to help us identify high-value targets and prioritize those that would maximize our return on investment.”Guiding XBOW and its resources also involved manual curation of bug bounty scopes and policies, a custom-scoring system for the agent to follow, SimHash fingerprinting techniques and a headless browser.XBOW founder Oege de Moor, who previously led GitHub Next, the company’s software research and development division, told CyberScoop that his startup is staffed primarily by researchers and experts in three fields: security, artificial intelligence and scalable systems. He described human involvement mainly at the start of the process to guide and prompt, and at the end to validate the tool’s findings, a HackerOne requirement for AI-bug bounty reports.“XBOW is a completely autonomous system, but you need to decide what you point it at, so you have to give it a URL to start with, possibly you might want to give it some additional information like credentials … at the very beginning,” de Moor said. “From that point, you select the target, you might give it some optional configuration but that’s it. Off it goes and it reports a bunch of exploits.”HackerOne tracks top-performing bug hunters in a variety of ways, including whether they focus on vulnerability disclosure programs or bug bounty programs, and the number of bugs they’ve discovered and validated. The leaderboards also award “reputation points” based on the quantity and complexity of bugs resolved and assign each bug an “impact score” between 1 and 50 to convey its severity and reach.Michiel Prins, co-founder and senior director of product management at HackerOne, told CyberScoop that some hackers and bug hunters earn a living solving lots of small bugs, while others focus on fewer, critical flaws that offer higher payouts and reputation rewards. XBOW’s output thus far, Prins said, is similar to the former group, with a high number of bugs solved but a reputation score of around 17, reflecting a focus on lower- to medium-severity issues.Speaking generally about tools like XBOW, Prins said “what we see is that they excel in volume … [but] it does not yet excel in business impact.”“It’s a workflow, and there’s a loophole in the workflow so that an adversary can accomplish something that is unintended,” he continued. “That is very hard for an AI to find, because the AI needs to really understand the intent of the application, the business context it operates in, and the whole environment around it.”That sentiment was shared by some other cybersecurity practitioners. Amélie Koran, who has worked on cybersecurity for Walmart, Electronic Arts and the federal government, said the tool’s record doesn’t suggest that it can replace humans to solve more difficult cybersecurity problems. “Looking at their profile on HackerOne, their badges are some of the more basic things you can find with automation: data leaks, XML exposure, cross-site scripting, command injection and access control,” she told CyberScoop. “I wouldn’t be so mean as to say these are rudimentary finds, but all of this is much more ‘surface material’ as opposed to more in-depth campaigns.”For his part, de Moor disagrees with this characterization, saying the company intends to release examples of higher-complexity bugs XBOW has found in the coming weeks. While XBOW sits atop the U.S. HackerOne leaderboards, multiple sources critiqued the idea of comparing the work of a tool managed by a company with the output of individual bug hunters. Even HackerOne has had to grapple with this problem: the company recently altered its leaderboards to split out bounty rankings from individuals and companies like XBOW in an effort to deal with similar complaints.“XBOW is a company, [and] there’s multiple people working behind it,” Prins said, explaining the decision. “There’s venture funding behind the company, there’s AI involved — that’s not unique, lots of hackers have AI in their toolkit — but it’s a company, it’s not just one person.”Right now, XBOW is operating in the red. While many projects and payouts remain in the pipeline, de Moor said the earnings it has generated hunting bugs thus far is less than the cost it takes to run the tool, which is “quite compute intensive and not cheap.”For this reason, the program is given a “time budget” for solving certain tasks, and if a task takes more than 100 attempts, that’s a sign that the model needs some tweaking by engineers — what de Moor calls “AI magic” — to make it more efficient. Like others, he believes this will change as improved data center infrastructure  makes AI tools like XBOW more affordable and practical.Capture the bagSo how did XBOW get to the top of the leaderboards to begin with? It stems from  improvements in LLMs’ ability to solve cybersecurity-specific problems.Most cyber professionals have taken part in “Capture the Flag” (CTF) challenges, where they are given a series of security-related puzzles and exploit vulnerabilities to “capture” a piece of data.  XBOW was originally trained on CTFs, and de Moor and others told CyberScoop that LLM technology has come a long way in solving these types of challenges. He estimated that a year ago, cutting-edge LLM programs were only capable of solving around 16% of CTF challenges they were given, and “only really quite simple ones.”But that has rapidly changed over the past year, and some AI cybersecurity experts said they believe that CTF-like challenges provide excellent foundational training for cybersecurity models.A recent study from DreadNode, an offensive security machine-learning platform, exemplifies this progress. The study found that some frontier LLMs like Anthropic’s Claude can now solve complex CTF challenges “with remarkable efficiency — competing in minutes what typically takes humans hours or days.”Many of the capabilities demonstrated across these challenges translate to different functions in cybersecurity, including AI red-teaming and penetration testing, bug-bounty hunting, vulnerability management and more effective monitoring of LLM-driven security threats. They’re far from dominating — Claude was only able to solve 43 of the 70 measured challenges — but their success rate has steadily improved in ways that make these tools more useful across different cybersecurity tasks.Will Pearce, DreadNode’s founder, told CyberScoop that the findings are reflective of how automation and AI tools are becoming commonplace in many cybersecurity jobs and functions, converging around a process that is “still human directed” but at a higher level of abstraction.“Whether it’s red-teaming or whether it’s bug hunting, whether it’s network ops or vulnerability discovery … anything you might want to do in cyber, you really just have this slow march towards an outcome that you want,” Pearce said.Notably, all the models tested flunked the two tasks that are the most time-consuming for humans to solve, suggesting that some aspects of security still require human ingenuity.De Moor said XBOW was also trained on CTF challenges, and the company developed a custom scoring system that allowed them to port that overall process to XBOW’s broader work hunting for vulnerabilities.Because CTF challenges tend to have binary results — you either obtain the flagged data or you don’t — it helps cut down on one of the biggest problems LLMs bring to the table: hallucination.But it doesn’t eliminate them. De Moor said XBOW’s false-positive rate now flutters between 0-10% depending on the type of vulnerability it’s working on, but stressed that every bug reported to HackerOne has been validated.The future of cybersecurity?Tools like XBOW represent a notable milestone for the cybersecurity industry, demonstrating substantive capabilities that could potentially offer real business tradeoffs — provided compute costs come down — in the near future.But veteran bug hunters aren’t stressing or rushing to branch out into other fields.Casey Ellis, founder and adviser at BugCrowd, another major bug bounty platform, told CyberScoop that XBOW appears to have been primarily designed as a web application penetration testing tool, with a workflow that is “autonomous within the scope you set for it.”“In general, the kinds of vulnerabilities it [and other semi-autonomous hacking agents] can find vary pretty wildly in impact, but they share a common attribute: They are relatively easy to test for, and easy to programmatically confirm,” Ellis said. “AI-driven hacking tools are naturally inclined to being effective at this broad characteristic within vulnerabilities, mostly because LLMs are very good at working with firm instructions and clear feedback loops.”Ellis doesn’t downplay the value of that kind of work. He noted that the internet is full of errors that allow for cross-site scripting, server-side request forgeries, exposed secrets and other programmatically predictable bugs. These programs do best when placed in “target rich environments for systems that can run 24/7 without sleep, and that are 100% optimized for their discovery.”Ellis believes that systems like XBOW will create more competition for human bug hunters at the initial discovery phase, comparing it to the emergence of external attack surface management platforms a decade ago that made it easier for practitioners to automate attack surface monitoringBut he doesn’t see AI bug hunting completely replacing humans anytime soon, noting that the discovery phase of bug-bounty work “isn’t the hard part” and the internet and software will remain riddled with security vulnerabilities to keep both man and machine occupied.“There’s plenty left behind and new vulnerabilities being introduced daily,” he said, “and the role for bounty hunters and researchers is to learn and understand what these systems are good at, what they aren’t, and where there’s an opportunity to complement human with machine.”The post Is XBOW’s success the beginning of the end of human-led bug hunting? Not yet. appeared first on CyberScoop.