Events
Beyond Chatbots: Unlocking Business Value with Multimodal LLMs
Most product teams start and stop at text-based AI. But the real world isn't text-only — people talk, point, watch, and interpret visuals. Here's how multimodal LLMs open up capabilities that chatbots alone can't touch, and how product managers can put them to work.
Beyond Chatbots: Unlocking Business Value with Multimodal LLMs
Presented at ProductHive Warsaw | June 11, 2025
Presentation: Beyond Chatbots
The World Isn't Text-Only
When we think about generative AI in products, most of us default to text: chatbots, summarization, search. That makes sense — text is where large language models started, and it's where most teams have the most experience.
But users don't operate in a text-only world. They speak, they scan images, they watch video, they interpret tone. If we want AI to integrate naturally into real workflows, it needs to handle more than words on a screen.
Modern multimodal LLMs can now process and generate across four core modalities — text, image, audio, and video — and the tooling has matured enough to build with them today. This isn't a research preview anymore. It's a product opportunity.
What "Multimodal" Actually Means
In the context of LLMs, a modality is a type of input or output data the model can process. Traditional LLMs handled text only. Multimodal models can "see" images, "hear" audio, and "watch" video in addition to reading and generating text.
Text acts as the universal hub — most modality combinations pass through it. Some combinations, like audio-to-audio or image-to-video, unlock capabilities you can't achieve through text alone. But not every pairing is practical. Audio-to-image, for instance, almost always routes through a text intermediary. The principle is straightforward: multimodality works best when each link in the chain brings something genuinely new.
A useful framework for product thinking: start with a modality, explore a key combination, identify the capability it unlocks, then connect it to a concrete use case.
Audio: The Dynamic Interface
Audio is quickly becoming essential to how users interact with AI. Three combinations matter most.
Text-to-audio converts written content into natural speech. Publishers generate narrated audiobooks in minutes. Customer service systems deliver real-time support across languages. Text-to-music is gaining traction in marketing and media production. The throughline is scalable, expressive audio for education, entertainment, and global communication.
Audio-to-text brings structure to spoken content. Transcription tools turn meetings, interviews, and podcasts into searchable text. Real-time captioning makes live events accessible. Voice commands power hands-free tools for frontline workers who can update systems without touching a screen. This modality bridges the gap between what's said and what systems can act on.
Audio-to-audio handles transformations within sound itself. Voice cloning is used in games, podcasts, and virtual assistants. Voice enhancement improves clarity in noisy environments — critical for remote teams and call centers. Accent modification helps global teams communicate more smoothly in real time. Google recently introduced live translation in Google Meet for English-Spanish, which is a meaningful step toward truly multilingual communication.
The audio AI landscape is vast, with hundreds of specialized tools across B2C (education, productivity, assistants) and B2B (training, finance, customer service). Horizontal platforms let you assemble components for your specific use case.
Building a Voice Bot: Easier Than You Think, Harder Than It Looks
To make this concrete, I built a voice-based conference concierge called HiveMind. The architecture is simple: transcribe voice to text, pass it to an LLM augmented with conference-specific data, then convert the response back to natural speech. I built it using ElevenLabs — set a system prompt, chose an LLM, fed in session data and speaker bios, picked a voice model, and deployed it.
The prototype was straightforward. But here's the reality check: shipping a production-grade voice assistant is a different challenge entirely. Real-time performance, user interruptions, audio data privacy, and backend integration quality all demand serious planning. Your voice solution is only as good as its ability to extract real user intent — and that depends on the quality of your data pipeline, not the novelty of the interface.
Image: From Generation to Transformation
AI unlocks three distinct ways to work with images: generating them from text, understanding their content, and transforming them based on other images.
Text-to-image turns language into visuals. Marketing teams generate ad creatives from prompts. Educators produce conceptual illustrations that make abstract ideas tangible. Custom avatars let users generate digital identities for gaming, profiles, or virtual events.
Image-to-text makes visual content machine-readable. Retail apps use visual search so users can upload a photo and find similar products instantly. OCR tools extract text from invoices and scanned documents — essential for digitization in finance and logistics. Object recognition supports warehouse inventory tracking. Scene understanding interprets full context — people, settings, activities — enabling better image classification.
Image-to-image transforms and upgrades existing visuals. Virtual try-ons let fashion retailers show products on customers without a fitting room. Style transfer enables marketing teams to re-skin visuals for seasonal or regional branding without a new photoshoot. Image enhancement sharpens medical scans in telehealth and improves clarity in low-light security footage. Designers use AI-powered editing to remove backgrounds, fix lighting, and retouch photos in seconds.
The global AI image generator market is projected to reach $420 million in 2024, growing at 18% annually. Industry leaders like OpenAI (DALL·E), Google (Imagen), Adobe (Firefly), Midjourney, and Stable Diffusion are pushing quality boundaries and integrating these capabilities into mainstream design tools.
A Real-World Test: Professional Photography vs. AI Enhancement
My wife is a professional photographer. For two years, I kept telling her about AI image generation — she wasn't impressed. Then she got a new project: ad images for a restaurant. I suggested she could take phone shots and enhance them with AI instead of using professional gear.
She ran a blind test. She showed the client two sets: images shot with professional equipment and manually retouched, alongside photos taken on a phone and enhanced with AI. The client chose the AI-enhanced version. That shifted her perspective entirely. She saw a real niche: fast, affordable culinary photography where customers are willing to pay for the speed and quality AI enables.
Video: The Most Complex Modality
Video is essentially moving images over time, often combined with audio. AI is transforming it across four directions: generating, understanding, animating, and enhancing visual content.
Text-to-video turns scripts into dynamic visuals. Marketing teams create promotional videos tailored to different audiences. Training teams generate onboarding content without filming. Tools like Synthesia produce talking avatars from plain text for internal communications, HR, or news updates. This modality puts video production in the hands of any team, regardless of budget.
Video-to-text extracts meaning from video, making it searchable and analyzable. Meeting platforms use live transcription for notes and action items. Compliance teams rely on automated scene summaries to review security footage. Media editors use highlight summarization to clip key moments from long events. The goal is making video content as usable as a document.
Image-to-video brings static images to life. Real estate platforms turn property photos into immersive video tours. Social media tools animate selfies and product shots into engaging clips. Marketing teams convert product images into promotional videos for dynamic ads.
Video-to-video improves existing footage by enhancing resolution, smoothing motion, changing styles, or swapping backgrounds. Personalized digital avatars have improved dramatically — earlier versions generated from static photos felt flat. Now, tools like D-ID create hyper-realistic avatars from short video clips that move, blink, and feel natural.
The global AI video generator market is projected to reach $706.6 million in 2024, growing at 20.6% annually. Key innovators include OpenAI (Sora), Google (Veo), Runway, Pika, and Synthesia.
How I Built a Presentation Teaser with AI
To demonstrate multimodal workflows in practice, I created a video teaser for this presentation using a chain of AI tools. I prompted ChatGPT to write the script based on my presentation content. I generated custom images for each scene using OpenAI's gpt-image-1 model. I animated those images with Sora. I cloned my voice and generated narration with ElevenLabs. I created a digital avatar with D-ID. And I assembled everything in Canva.
Each tool handled a different modality. Together, they produced something none could have built alone. That's the power of multimodal thinking applied end-to-end.
But video AI still has notable limitations. Most models support only short clips, quality may not match traditional production, and outputs can be unpredictable. Legal and ethical risks are heightened with video, especially around deepfakes and content rights — strong safeguards are non-negotiable before launching.
Prototyping: Where PMs Build Intuition
You might reach for platforms like Replit, Lovable, or V0 — and they work well. But for AI and data product proofs of concept, frameworks like Streamlit are often good enough.
Here's what I've found most valuable about hands-on prototyping: it forces clarity. When you use an LLM to generate code for a working prototype, you immediately see what you've missed or underspecified in your requirements. The model we used in one demo, gpt-image-1, had just launched — there's no way ChatGPT would have picked it on its own. I had to share documentation and guide it.
I don't believe PMs need to be technical. But having a solid understanding of what these systems can and can't do makes a measurable difference. Being hands-on with prototypes helps you write better specs, faster. That's one of the most valuable skills a PM can build today.
Challenges Across Modalities
Each modality comes with its own set of challenges.
Audio requires robust infrastructure for real-time performance and careful handling of privacy and compliance around voice data. Image generation can get expensive at scale, carries IP and copyright risks, and demands ongoing prompt tuning plus human review to maintain brand-safe quality. Video is the most complex — short clip limitations, unpredictable outputs, and the highest legal and ethical exposure, particularly around deepfakes.
The common thread: cool technology alone isn't enough. Every modality needs a clear answer to the question "what's the real value for the user?"
Takeaways for Product Managers
Think beyond chat. Text is the starting point, not the finish line. Multimodal LLMs open capabilities that text alone can't deliver.
Prioritize business value over novelty. Focus on modality combinations that solve real problems, not demonstrations that impress in a meeting but stall in production.
Prototype fast, scale with discipline. Rapid prototyping tools make it easy to validate ideas. But the transition from proof of concept to production for AI products is harder than for traditional software. Plan for it.
Stay close to users. The most valuable multimodal AI is deeply integrated into real workflows, not bolted on as a feature.
Invest in responsible AI. As you deploy across modalities, keep privacy, intellectual property, and ethics front and center. The stakes increase as you move from text to audio to image to video.
Start Small, Start Now
Don't let multimodal AI remain a buzzword. This week, pick one workflow in your product and run a quick prototype using text, image, or voice. Share your results with your team. The real value of generative AI is unlocked through iteration, not speculation.
View more articles
Learn actionable strategies, proven workflows, and tips from experts to help your product thrive.



