I tried Claude Sonnet 5 with prompts that ask it to finish the job, not just answer the question — and that's where the AI war is going

Wait 5 sec.

Anthropic has just released Claude Sonnet 5 for all users, and I wanted to test what it was good at. But the game has changed now. Sonnet 5 doesn't feel dramatically different from Gemini or ChatGPT if you ask it ordinary chatbot questions. Instead, the difference should show up when you stop asking for answers and start asking for completed work.Anthropic says Sonnet 5 is built for "multi-step software engineering work," sustained coding, tool use, debugging, and "messy technical contexts." It also says it can make plans, use browsers and terminals, and run more autonomously than smaller, cheaper models previously could.I'm not using Sonnet 5 for coding, but that doesn't mean I can't take advantage of its new abilities — just like you can. So I stopped asking Claude for answers and started asking it to finish jobs, beginning with planning a trip to Bath, UK, for my family: my wife, me, and two teens.A trip to BathWhen I tested it, Claude Sonnet 5 defaulted to its Medium level of effort, so that's what I used. Here's the first prompt I tried:"I want to test whether you can act more like an agent than a chatbot.My task is: Plan a weekend trip to Bath for two adults and two teenagers, including travel, lunch, one activity, estimated costs, and what still needs booking.Don't just give me advice. First, make a brief plan. Then identify which parts of the task you can complete yourself right now, which parts require tools or information you don't have, and which parts need human judgment.Then complete as much of the task as possible without stopping after the first obvious answer.At the end, give me:What you completedWhat still needs human actionAny assumptions you madeA short checklist I can use to verify the resultThe next best step"What I really liked was that, as Claude tackled this task, it gave me the option to be notified when it had finished. In reality, it only took a few seconds to come back with a plan, which included travel options, an itinerary, and a suggestion for lunch and something to do: a trip to The Roman Baths.To my delight Claude gave me an interactive map showing where all the places it recommended were. It also gave me a useful list of what it had completed, what required human action, the assumptions it had made, a verification checklist, and a "next best step" action point. It felt ready to keep working with me as more details came in, rather than treating its first answer as final.In fact, when I gave it more details, such as which day I was going to go, it gave me a visual weather report for the day. That was a really nice touch.Claude Sonnet 5 produced a handy map showing where to go. (Image credit: Anthropic)Claude vs ChatGPTI also tried this prompt with ChatGPT-5.5 Medium and got a similar result. It acted as an agent, just like Claude did, and notified me when it had finished its tasks. It just didn't look as nice. There was no map, or any visual elements at all, and it felt more like I had been given a finished report than the start of a two-way conversation where it asked me for more details.Both chatbots recommended lunch and a trip to The Roman Baths. Interestingly, ChatGPT assumed I’d get the train, while Claude assumed I’d drive. They also recommended different places to eat, but the core information they both provided was solid.What was most impressive was that both models could adapt when I reframed the inputs. For example, when I gave them the ages of the kids, student status, a different mode of transport, or changed the day of the trip, both models could cope. Both also identified that since the oldest was a university student, he could get free entry to The Roman Baths.This part of the test was probably the most meaningful, as it felt much more "multi-step" than simply providing one answer.Overall, I’d give this test to Claude. You can clearly see that Sonnet 5 is set up for agentic actions. Neither Claude nor ChatGPT could actually do any of the booking for me at the moment, so we're still a long way from true personal-assistant-level autonomy. But for this kind of task, Claude currently has the edge.A different domainI wanted to test the models in a different domain that would let Claude show me it had genuinely improved, and that the Bath trip result was not just a fluke of the travel-planning use case. So I asked them both to:"Build me a simple household budget tracker as a spreadsheet or small tool."Both models thought for a while about this task, and churned through various options before opting to make a spreadsheet. ChatGPT produced a spreadsheet with a bar chart that tracked how much I’d spent on various household expenses against a budget. Claude, however, went for something simpler: dispensing with a budget, it just tracked actual expenses and created a pie chart showing where my money was going.Claude’s initial approach was simpler, and easier to understand. Both models provided a .xlsx file, but only Claude provided a button to upload it straight to Google Drive so I could open it in Sheets.I told ChatGPT, "I wanted the graph to be a pie chart," and it responded: "Absolutely — I’ll update the spreadsheet itself so the dashboard uses a pie chart for spending by category, rather than the current graph style."It ran into a few problems because it was trying to show both the budget and actual values in the same pie chart, but eventually it worked out that it could show only one and produced a new spreadsheet that did exactly what I asked for.I then asked Claude to change its spreadsheet to provide a budget section too, and to change the graph into a bar chart. Again, it showed me its workings and added a budget section and bar charts perfectly.I can’t really separate the two AI models on this task. Both proved they can handle multi-step tasks well, and both were happy to revise the result when I changed the brief.That, really, is the point. The most interesting AI tests now are not "which chatbot gives the best answer?" They are "which assistant keeps working until the job is actually done?"On that front, Claude Sonnet 5 feels extremely capable. ChatGPT was close behind, and in some ways just as effective, but Claude felt more naturally organized around the idea of completing work rather than simply responding to prompts. It asked fewer invisible questions, presented its output more helpfully, and made the whole process feel more like collaborating with an assistant than interrogating a chatbot.For now, neither model is ready to fully take over the job. I still had to check the details, make the decisions, and do the actual booking or uploading myself. But the direction of travel is obvious. The AI war is no longer just about who has the smartest chatbot. It’s about who can build the assistant that gets you closest to a finished task.