Events
Which Agent Tool Should We Use?
The London Agentic AI Meetup drew over 1,200 registrations for 150 seats. The demand wasn't driven by hype — it was driven by a question most teams are getting wrong. Here's what the evening revealed about agentic coding in practice.
Stop Asking "Which Agent Tool Should We Use?"
On February 5th, London Agentic AI hosted its fourth meetup at Google's campus on Bonhill Street. The topic: Agentic Coding Architectures — Tools, Models, and Harnesses. The speakers were practitioners from Google DeepMind. The audience was engineers building agentic systems in production.
I traveled to London for it. It was worth the trip.
Not because the event was flashy. It wasn't. It was worth it because it exposed how much operational rigor sits underneath the demos we see on social media — and how far most teams still are from internalizing that rigor.
The Wrong Question
Most teams entering the agentic coding space start with the same question: "Which agent tool should we use?"
It feels like the right question. The market is crowded. Google alone offers AI Studio, Jules, Gemini CLI, Gemini Code Assist, Firebase Studio, and the newly announced Antigravity platform. Beyond Google, there's Claude Code, GitHub Copilot, Cursor, Windsurf, Augment Code, Amp, and more. New tools drop every few weeks.
But "which tool?" is a question about inputs. The better question is about outcomes: "What workflow bottleneck are we trying to solve, and what level of human oversight does it require?"
That reframe shaped the entire evening at DeepMind.
Google's Ecosystem: Many Tools, No Single Winner
Ian Ballantyne walked through Google's agentic coding ecosystem, and the most striking takeaway was the intentional fragmentation. Google doesn't have one coding agent. It has several, each designed for a different pattern of human-AI collaboration.
Google itself frames this as a spectrum. On one end sits supervised collaboration — tools like Gemini Code Assist inside your IDE, where the AI acts like a team member you're actively directing. In the middle sits the Gemini CLI, an open-source terminal agent for interactive coding, debugging, and task management. On the other end sits Jules, an autonomous agent that works asynchronously in a virtual machine — cloning repos, installing dependencies, modifying files, and delivering pull requests while you do something else.
The key insight: these aren't competing products. They're designed for different trust thresholds and different costs of being wrong. An IDE assistant is appropriate when you need tight control and immediate feedback. An autonomous agent like Jules makes sense when the task is well-scoped, clearly articulated, and the blast radius of a mistake is contained.
No single tool wins across all scenarios. The architecture decision — which pattern of human-AI collaboration to apply — matters more than the model powering it.
Spec Clarity Beats Prompt Cleverness
Ricardo Sueiras' talk on Spec-Driven Development reinforced a principle that sounds obvious but is widely ignored: if you can't define your idea clearly, the agent can't build the system.
Spec-Driven Development has emerged as one of the most important practices in agentic coding. The core idea is that well-crafted specifications — not casual prompts — should be the source of truth that guides AI code generation. The workflow follows a deliberate sequence: specify, plan, decompose into tasks, implement.
This is the inverse of "vibe coding," where you describe a goal loosely and hope the agent figures it out. Vibe coding works for quick prototypes. It breaks down for production systems, complex codebases, and anything where the cost of rework is high.
GitHub released Spec Kit, an open-source toolkit for spec-driven workflows, noting that the problem with most agent interactions isn't the coding ability — it's the approach. As Thoughtworks observed in their analysis, the practice is still emerging, but the direction is clear: as context windows grow and agents become more capable, the quality of what you feed them determines the quality of what comes out.
For product managers, this is a familiar principle wearing new clothes. Requirements have always mattered. The difference now is that vague requirements don't just slow down engineers — they generate plausible-looking code that compiles, passes superficial checks, and quietly misses the actual intent. The cost of ambiguity has gone up, not down.
Orchestration and Harnesses Over Model IQ
One theme ran through every talk and into the panel: the infrastructure around the model matters more than the model itself.
Orchestration — how you chain agent actions, manage context, handle errors, and maintain state across multi-step workflows — is where production systems succeed or fail. Harnesses — the evaluation frameworks, guardrails, and feedback loops that wrap agent execution — are what separate a demo from a deployable system.
This aligns with what Anthropic has documented in their research on building agentic systems: there are fundamental patterns (prompt chaining, routing, orchestrator-workers, evaluator-optimizer loops) and the right pattern depends on the task, not the model. The temptation is to throw a more powerful model at a problem. The more reliable approach is to improve the harness.
The panel discussion with KP Murphy-Sawhney and Matthew Mauger, both senior software engineers at Google DeepMind, went deep into the real-world constraints of deploying agentic coding systems. The conversation was grounded in production realities: latency, error recovery, context management, security.
Evaluation Is the Real Bottleneck
If there was a single uncomfortable truth that the evening made explicit, it was this: evaluation, not generation speed, is the bottleneck in agentic coding.
Agents can generate code fast. The hard part is knowing whether that code is correct, secure, maintainable, and aligned with intent. Current benchmarks are catching up, but the gap is significant. Research from FeatureBench shows that frontier models achieving over 70% on standard bug-fixing benchmarks like SWE-bench succeed on only 11% of feature-level development tasks — the kind of complex, multi-file work that real engineering involves.
For product managers, this has direct implications. The velocity gains from agentic coding are real, but they shift the bottleneck from writing code to reviewing it. Every failed quality check requires human intervention. When combined with the volume of agent-produced code, that verification step can overwhelm teams that haven't planned for it.
The practical takeaway: if you're introducing agentic coding tools, invest at least as much in your evaluation and review process as you do in the tools themselves. Code generation without reliable evaluation is just faster technical debt.
Diversity Improved the Conversation
One observation that stuck with me: the toughest reliability and security questions during the Q&A came from perspectives we don't see enough on stage at AI events. Intentional diversity in the room didn't just check a box — it materially improved the quality of the discussion.
Shashi Jagtap, who organizes London Agentic AI, has built something rare: a builder-first, technically deep community that draws over 1,400 members in its first months. The demand for this event — 1,200 registrations for 150 seats — signals that practitioners are hungry for substance over spectacle.
What This Means for Product Managers
Agentic coding is still early. There are no stable best practices. It comes down to experimenting and figuring out what works for your specific workflow, constraints, and risk tolerance.
But the signal from that evening at DeepMind was clear, and it maps directly to how product managers should think about adoption:
Don't chase tools. The market will keep shifting. What matters is understanding the patterns of human-AI collaboration — supervised, interactive, autonomous — and matching them to your team's trust threshold and the cost of being wrong.
Spec clarity is a product skill. Spec-Driven Development isn't just an engineering practice. It's a requirement-writing discipline. Product managers who can define intent precisely will get dramatically better results from agentic tools than those who rely on loose prompts.
Orchestration matters more than model selection. The harness around the agent — evaluation, error handling, context management — determines whether you ship quality or ship debt. Advocate for investment in infrastructure, not just tooling.
Evaluation is your responsibility. If your team adopts agentic coding without a clear plan for reviewing and validating output, you're trading speed for risk. Build the review process before you scale the generation.
Pick one bottleneck. Run a tight experiment. Measure impact. Don't try to transform your entire SDLC. Identify a single workflow friction point, apply an agentic tool to it, and measure whether it actually improves outcomes — not just output volume.
Start with the Bottleneck, Not the Buzzword
The uncomfortable truth is that coding with AI is still early. The tooling is evolving weekly. The evaluation landscape is fragmented. The best practices are being written in real time by teams willing to experiment, fail, and document what they learn.
What I took away from that evening wasn't a recommendation for a specific tool or framework. It was a framework for thinking about adoption: understand the collaboration pattern, invest in spec quality, build the evaluation infrastructure, and measure outcomes rigorously.
If you're building with agents, the first question isn't "which tool?" It's "what decision are we validating first?"
Start there.
View more articles
Learn actionable strategies, proven workflows, and tips from experts to help your product thrive.



