Experiments
From the Bag to the Cloud: How We Built a Real-Time AI Boxing Coach
This article explains how we built Cornerman AI, a real-time AI boxing coach that delivers live voice guidance and basic visual feedback during solo training. It shares key technical lessons about using multimodal AI, cloud architecture, and dedicated vision models for fast sports movement. Ultimately, it shows that successful live AI products depend less on raw model power and more on smart system orchestration.
10 min

From the Bag to the Cloud: How We Built a Real-Time AI Boxing Coach
Note: This project was built during the Gemini Live Agent Challenge. Details about the scope of the submitted project can be found here. The article below also discusses concepts that were not presented during the hackathon or fully implemented at that time.
There’s a particular frustration every fighter knows.
You’re standing in front of the bag. You throw a jab. Then a cross. Maybe a hook. You move, reset, fire again. And then, somewhere in the middle of the round, momentum breaks. Not because you’re exhausted, but because you’ve run out of direction. There’s no coach in your ear, no correction between combinations, no one to tell you what to throw next or whether your right hand is coming back to guard.
That moment — small, familiar, easy to overlook — was the starting point for Cornerman AI.
Combat sports are practiced by millions of people recreationally, while access to consistent one-on-one coaching remains limited. Research on combat sports participation reflects that gap between the scale of participation and the availability of expert supervision. For most athletes, especially outside elite environments, a large share of training happens alone. That means repetition without feedback, effort without correction, and improvement that depends as much on luck as on structure.
We wanted to see whether recent advances in real-time multimodal AI could narrow that gap.
Not in theory. In a gym.
The idea: what if the coach could stay live?
At the time, we had both been watching the evolution of live multimodal models closely. Google’s Gemini Live API had reached a point where real-time, stateful interaction over WebSockets — including voice, vision, and tool calling — was finally practical enough to build against. The API is explicitly designed for low-latency, continuous sessions rather than one-shot prompt/response exchanges.
That distinction mattered immediately.
A boxing round is not a chatbot session. It doesn’t pause politely while the user waits for a paragraph. It is fast, messy, physical, and cognitively overloaded. Any product that wanted to belong in that environment had to work hands-free, react in real time, survive interruptions, and still feel natural under pressure.
So we asked a simple question:
What happens if you put a live multimodal AI agent in the middle of a boxing session?
That question became Cornerman AI.
What we built
Cornerman AI is a real-time AI boxing coach designed to work during live training.
The experience is intentionally simple. The fighter opens the app, starts a session, sets up the round, and begins working. From there, the interaction becomes hands-free.
The coach speaks out combinations, pacing cues, and motivation through live voice. The athlete can interrupt mid-round with something like “break it down”, and the system pauses, explains the sequence, then resumes without losing context. Meanwhile, the camera watches for slower visual signals — things like a dropped guard, inactivity, or obvious stance drift — and triggers corrections when confidence is high enough.
The ambition was never to make a model “understand boxing” in the abstract. The ambition was to build something that behaved usefully inside the tempo of a round.
That turned out to be a very different problem.
The architecture looked straightforward — until the gym got involved
On the backend, we ran the system on Cloud Run, with credentials managed through Secret Manager. Cloud Run was a strong fit because it supports WebSockets and autoscaling, which made it practical to handle live bidirectional sessions without standing up heavier infrastructure too early.
The live interaction layer sat on top of Gemini Live API. That gave us stateful audio sessions, multimodal input, interruption handling, and the ability to pass structured events into the conversation through tools and function calling.
On paper, it looked elegant.
In reality, the gym broke our first assumptions almost immediately.
Challenge one: the gym is an adversarial audio environment
The first version failed because we built for a clean room and deployed into chaos.
A boxing gym is full of hostile audio: music, coaches shouting, bags swinging, gloves cracking, shoes scraping, people talking, sparring, breathing. None of that is background noise in the usual product sense. It is the environment itself.
This is where native audio became more than a feature checkbox.
Google describes Gemini’s native audio capabilities as processing spoken interaction natively, rather than flattening it into a plain transcript. Tone, pace, prosody, and nonverbal cues are part of the signal, which makes the interaction better suited to real-time spoken conversation.
That mattered because fighters don’t talk like office users. Mid-round speech is clipped, breathless, urgent, distracted. A tired “break it down” and a calm “break it down” are not the same request, even if the words are identical. In a traditional speech-to-text-to-LLM-to-speech chain, a lot of that context disappears at the first boundary. Native audio preserved more of it.
That alone didn’t solve the problem, but it changed the feel of the interaction. The coach began responding to the athlete as a person under effort, not as a transcript.
We also had to re-engineer around latency. Google’s own Live API guidance recommends sending small chunks of audio rather than buffering large ones, precisely because low latency is central to the experience. In a boxing session, that recommendation is not an optimization. It is the product.
Challenge two: vision is useful — but only inside its limits
The second big lesson was about vision.
At first glance, “AI coach with a camera” sounds like one capability. In practice, it splits into two very different tasks.
One is static or slow-changing observation: is the guard low, is the athlete idle, is the stance visibly off, is the fighter still in frame?
The other is fast movement analysis: did that jab fully extend, did the rear hand recover, was the weight transfer right, what combination was actually thrown?
Those are not the same problem.
Google’s documentation for Gemini Live is clear that live video is processed at up to 1 frame per second, which makes it unsuitable for use cases involving fast-changing motion such as high-speed sports.
That single constraint forced a major product decision.
A boxing combination often happens in well under a second. At 1 FPS, you are not observing the combination. You are sampling around it. So we stopped pretending that live multimodal vision could do frame-accurate boxing analysis and narrowed the role of vision to what it could do credibly:
notice a dropped guard,
detect inactivity,
confirm fighter presence,
flag obvious static posture issues.
Once we respected that boundary, the system became much more reliable.
This was one of the most important decisions we made. A lot of AI products become less useful because they refuse to admit where the model stops being trustworthy. In our case, usefulness came from reducing the scope of live vision, not expanding it.
The real breakthrough was not more intelligence — it was better orchestration
The biggest improvement to the product came when we stopped asking the model to infer everything from raw media.
Instead, we began experimenting with passing structured triggers into the live session.
Gemini Live supports tool use and function calling inside ongoing sessions, which means an application can inject external state and events into the model’s context while the conversation is still unfolding.
That gave us a much cleaner architecture.
Rather than asking the model to “figure out” the whole situation from live audio and vision alone, we could tell it what mattered:
the fighter requested an explanation,
the round is paused,
the combo is complete,
the guard dropped,
the athlete has gone idle,
the session should resume.
Once those signals were explicit, the model could do what it is genuinely good at: timing, tone, phrasing, pacing, and adapting speech naturally.
That was also the moment when the coach stopped feeling scripted.
Originally, we tried to control the system by writing too much of what it should say. Over time, we learned that the stronger approach was to define how the coach should think instead of scripting exact responses. Brief during the round. Explanatory on interruption. Corrective only when confidence is high. Motivational, but never theatrical. Always in service of rhythm.
That shift mattered more than any individual prompt tweak.
Punch detection exposed the limits of a single-model approach
One of the hardest technical gaps was punch detection.
Humans can hear the rhythm of a round immediately. They can distinguish clean impact from weak contact, combinations from single shots, work rate from drift. The model could respond conversationally, but it did not natively interpret punch sounds with the level of specificity the product needed.
So we explored a separate waveform-based approach to detect impact patterns and convert them into structured events. The goal was not perfect semantic understanding of every strike. The goal was to give the system enough deterministic signal to know when work was happening, when a burst ended, or when a silence after effort meant something.
That experiment taught us a larger lesson.
The future production version of this kind of product should not be built around one model doing everything. It should be hybrid by design.
Why YOLO became the obvious next step
The moment we accepted that 1 FPS live vision would never be enough for punch-level analysis, the next step became clearer: use a dedicated real-time vision stack for fast movement.
That is where YOLO-style models started to make much more sense. Ultralytics positions YOLO as a real-time computer vision family built for object detection and related tasks such as tracking and pose estimation, with an emphasis on speed suitable for video applications.
For boxing, that matters.
A dedicated detector or pose-based model can run at frame rates that actually capture athletic movement. It can observe glove trajectories, shoulder motion, recovery patterns, stance geometry, and combination flow over adjacent frames. That is the right substrate for analyzing punching mechanics in real time.
And once that subsystem produces structured events — jab detected, rear hand late on return, combination length three, tempo drop — a live conversational model can turn those signals into something useful and human: a timely correction, a concise cue, a motivational intervention, an explanation between bursts.
That architecture is simply more honest.
Gemini Live handles voice, interruption, natural dialogue, and coaching delivery.
Dedicated vision models handle sub-second motion analysis.
Deterministic logic decides when an event is important enough to surface.
Humans remain the final authority for deep technical critique.
That is not a compromise. It is the product maturing.
Post-round analysis turned out to be a better fit for multimodal AI
Interestingly, while live vision had hard limits, post-round analysis looked much more promising.
Once the time pressure disappears, the model can contribute in a different way. Reviewing clips after the round, Gemini was useful for simple observations: rough punch counts, repeated guard drops, visible stance patterns, and basic geometric relationships in posture. Where we became cautious was in higher-order technique judgment. The line we ended up trusting was straightforward:
simple counts and static measurements, yes; deep technical critique, not without human review.
That distinction suggests a much stronger commercial direction than “AI replaces coaching.”
A better product story is this: AI handles triage, summarization, and pattern surfacing. Human coaches handle expertise, interpretation, and individualized correction. In practice, that could become the basis for a trainer marketplace or remote analysis workflow — AI flags the moments, humans deliver the real coaching judgment.
That is a far more credible future than pretending the model can do all of it alone.
The biggest product lesson: the magic is in the allocation
The most important thing we learned is that the core challenge was never just model capability.
It was allocation.
What belongs to the live model?
What belongs to deterministic software?
What belongs to a specialized vision system?
What still belongs to a human coach?
Once we started answering those questions honestly, the product improved quickly.
The live model is excellent at spoken interaction, interruption, rhythm, and adaptive delivery. Google’s own documentation supports exactly that reading of the technology: low-latency live sessions, native audio, expressive interaction, and tool-mediated context.
The live model is not the right place for frame-accurate analysis of fast boxing exchanges, and Google’s video limits make that explicit.
Fast physical analysis belongs to dedicated vision systems.
Judgment still belongs to humans.
And the thing that makes all of it feel like one product is the orchestration layer in between.
That is where the real work lives.
Why this mattered beyond boxing
What made Cornerman AI compelling to us was not just that it worked in a niche athletic setting. It was that boxing exposed something true about real-time AI products more generally.
If the environment is noisy, physical, and time-sensitive, then the model alone is never the product.
The product is the system around it:
the transport,
the session design,
the event layer,
the safety boundaries,
the timing logic,
the division of responsibility between probabilistic and deterministic components.
Boxing just made those requirements impossible to ignore.
There is nowhere to hide in a gym. Either the intervention lands at the right time, or it doesn’t. Either the voice feels alive, or it doesn’t. Either the correction is credible, or it isn’t.
That is why this project was so clarifying.
From the bag to the cloud
Cornerman AI started with a simple training frustration: the moment when solo work loses shape because no one is there to guide it.
What we ended up building was not an “AI that knows boxing” in some broad, abstract sense. It was something more practical: a live coaching system assembled from the right parts, each used for what it actually does well.
Cloud Run gave us a scalable real-time backend.
Gemini Live gave us expressive voice, interruption, and live multimodal interaction.
Dedicated vision models pointed toward the next step for movement analysis.
But the deeper lesson had less to do with tools than with product design.
The system became real when we stopped trying to make one model do everything.
That was the shift.
Voice where voice matters.
Vision where vision is fast enough.
Triggers where timing is non-negotiable.
Humans where expertise still matters most.
For us, that is what building with modern AI looks like when the domain is physical, messy, and live.
Not magic.
Just good orchestration.
View more articles
Learn actionable strategies, proven workflows, and tips from experts to help your product thrive.


