Beyond Text: Exploring the World of Multimodal Generative AI
Generative AI is moving beyond text. Discover multimodal AI, how it combines text, images, audio, and video, and its transformative potential across industries.
2025 is becoming a key year for AI in business, and multimodal learning is the main reason why. Unlike older models that only handle text, multimodal AI works with images, video, and audio, too. This allows it to understand context much more deeply, leading to smarter, more accurate, and more natural-feeling results.
For businesses, this opens the door to analyzing complex, mixed-media data, simplifying workflows, and making AI-driven insights easier for everyone to use. It’s quickly becoming an essential tool for modern operations.
Here are a few key benefits of this approach:
- Enhanced Contextual Understanding: By processing diverse data types (images, video, audio, text), these models achieve a deeper contextual grasp, akin to human intuition. This leads to more accurate interpretations and relevant responses.
- Rich, Personalized Outputs: Learning from multiple modalities allows for outputs that are not just precise but also highly customized, fostering more natural and intuitive user interactions.
- Advanced Data Applications: This technology empowers businesses to analyze complex mixed-media data, streamline multi-format workflows, and broaden access to AI-driven insights.
- Creative Empowerment: Artists, writers, and creators can use these tools to bring their ideas to life in new ways, from instantly illustrating stories to animating concepts.
- Improved Accessibility: Multimodal models can make technology more inclusive by describing images for users with visual impairments or translating sign language in real time.
Multimodal Input: The Next Leap in AI Interaction
It’s not just about what AI can create—it’s also about how we interact with it. The latest systems can take in multiple types of input at once, making conversations with AI feel more like talking to another person.
Understanding Multimodal Input Capabilities
Modern multimodal LLMs like GPT-4V, Claude 3 Opus, and Gemini Pro can now process:
- Images alongside text queries: “What’s in this image?” or “Can you describe what’s wrong with this circuit diagram?”
- Audio alongside text: “Translate this speech recording” or “What emotions do you detect in this audio clip?”
- Video understanding: “Summarize what happens in this video clip” or “Identify the key events in this recording”
This enables a much more flexible and powerful way to interact with AI systems that better matches how humans naturally communicate. Here is a Python snippet demonstrating how to use a multimodal model to analyze an image.
# pip install transformers torch pillow requests
from transformers import CLIPProcessor, CLIPModelimport torchfrom PIL import Imageimport requests
# Load a CLIP model (this can be finetuned on medical data)model_name = "openai/clip-vit-base-patch32"model = CLIPModel.from_pretrained(model_name)processor = CLIPProcessor.from_pretrained(model_name)
# Example image URLimage_url = "https://example.com/chest_xray.jpg"
try: response = requests.get(image_url, stream=True) response.raise_for_status() image = Image.open(response.raw)except Exception as e: print(f"Error loading image: {e}") image = Image.new('RGB', (224, 224), color='gray')
# Define possible medical findingsmedical_texts = [ "a normal chest X-ray with clear lungs", "chest X-ray showing pneumonia", "chest X-ray with pleural effusion", "chest X-ray showing pneumothorax", "chest X-ray with enlarged heart", "chest X-ray showing pulmonary edema", "chest X-ray with lung collapse", "chest X-ray showing lung nodules", "chest X-ray with lung consolidation", "chest X-ray showing emphysema"]
inputs = processor( text=medical_texts, images=image, return_tensors="pt", padding=True)
with torch.no_grad(): outputs = model(**inputs)
logits_per_image = outputs.logits_per_image probabilities = torch.softmax(logits_per_image, dim=-1)
# Get top predictionstop_indices = torch.argsort(probabilities, descending=True)[0]
print("Medical findings analysis for chest X-ray:")print("-" * 50)for i in range(min(5, len(medical_texts))): idx = top_indices[i] finding = medical_texts[idx] confidence = probabilities[0][idx].item() print(f"{finding}: {confidence:.3f} ({confidence*100:.1f}%)")
print(f"\nMost likely finding: {medical_texts[top_indices[0]]}")print(f"Confidence: {probabilities[0][top_indices[0]].item()*100:.1f}%")
Real-Time Multimodal Communication
One of the most exciting developments is the ability to create real-time multimodal communication systems. Libraries like FastRTC are making it easier to build applications that can process audio, video, and text in real-time:
# Example of a voice chat system using FastRTC with Geminifrom fastrtc import Stream, ReplyOnPause, get_stt_model, get_tts_modelfrom google import genai
stt_model = get_stt_model()tts_model = get_tts_model()client = genai.Client(api_key="GEMINI_API_KEY")
def voice_llm_chat(audio): # Convert speech to text user_message = stt_model.stt(audio)
response = client.models.generate_content( model="gemini-1.5-flash", contents=user_message, config={"max_output_tokens": 200,} )
# Convert response to speech ai_response = response.text for audio_chunk in tts_model.stream_tts_sync(ai_response): yield audio_chunk
# Create and launch the streamstream = Stream(ReplyOnPause(voice_llm_chat), modality="audio", mode="send-receive")stream.ui.launch()
This example demonstrates how easily modern tools can create voice-based AI assistants that understand speech, process it with an LLM, and respond with synthesized speech—all in real-time.
Multimodal Context Windows and Chain-of-Thought
The latest multimodal models can maintain context across different modalities. For example:
This creates a much more dynamic interaction than was previously possible:
# Conceptual example of a multimodal conversationconversation = [ {"role": "user", "content": [ {"type": "text", "text": "Can you explain what's happening in this circuit?"}, {"type": "image", "image_url": "https://example.com/circuit_diagram.jpg"} ]}, {"role": "assistant", "content": "This appears to be an amplifier circuit using a transistor. The input signal comes in through the capacitor C1 on the left."}, {"role": "user", "content": "What would happen if I increased R2's resistance?"}, {"role": "assistant", "content": "If you increase R2 (the resistor connected to the collector), you would increase the gain of the amplifier but potentially reduce the maximum output swing due to the voltage drop across R2."}]
Video Understanding and Processing
Recent advancements have also enabled AI systems to understand video content, which combines spatial understanding (images) with temporal understanding (how things change over time):
import osimport cv2import numpy as npimport torchfrom transformers import AutoProcessor, AutoModelForVideoClassification
model_name = "MCG-NJU/videomae-base-finetuned-ssv2"
# Load model for video classificationprocessor = AutoProcessor.from_pretrained(model_name)model = AutoModelForVideoClassification.from_pretrained(model_name)
# Ensure model is on GPU if availableif torch.cuda.is_available(): model.to("cuda")
# Function to process videodef analyze_video(video_path, sample_frames=16): if not os.path.exists(video_path): print(f"Error: Video file not found at {video_path}") return []
cap = cv2.VideoCapture(video_path) if not cap.isOpened(): print(f"Error: Could not open video file at {video_path}") return []
frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) if frame_count == 0: print(f"Error: No frames found in video at {video_path}") cap.release() return []
actual_sample_count = min(sample_frames, frame_count) indices = np.linspace(0, frame_count - 1, actual_sample_count, dtype=int)
frames = [] for i in indices: cap.set(cv2.CAP_PROP_POS_FRAMES, i) ret, frame = cap.read() if ret: # Convert BGR (OpenCV default) to RGB frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) frames.append(frame) cap.release()
if not frames: print(f"Warning: No frames extracted from {video_path}.") return []
if len(frames) < sample_frames: while len(frames) < sample_frames: frames.append(frames[-1]) elif len(frames) > sample_frames: frames = frames[:sample_frames]
inputs = processor(images=frames, return_tensors="pt")
# Move inputs to GPU if available if torch.cuda.is_available(): inputs = {k: v.to("cuda") for k, v in inputs.items()}
with torch.no_grad(): outputs = model(**inputs) probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
k = min(3, probs.shape[-1]) top_prob, top_label_indices = torch.topk(probs, k=k)
predicted_labels = [] for prob, label_idx in zip(top_prob[0], top_label_indices[0]): predicted_labels.append((model.config.id2label[label_idx.item()], prob.item())) return predicted_labels
results = analyze_video("painting.mp4")
if results: print("\n--- Video Classification Results ---") for label, confidence in results: print(f"{label}: {confidence:.4f}")else: print("Could not analyze video or no results obtained.")
With models like Gemini Pro, SmolVLM2, and others, developers can now create applications that understand video content, answer questions about it, and generate descriptions—expanding the frontier of AI capabilities beyond static images.
Key Modalities and Examples
Text & Image: The most mature area, with powerful text-to-image generation and image captioning models. Here’s a simplified example using the Hugging Face
diffusers
library to generate an image from a text prompt with Stable Diffusion:from diffusers import StableDiffusionPipelineimport torch# Load modelmodel_id = "stable-diffusion-v1-5/stable-diffusion-v1-5"pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)pipe = pipe.to("cuda") # Move to GPU# Generate image from textprompt = "A city after an apocalypse where trees are growing on buildings most buildings are destroyed, futuristic architecture, " \"overgrown nature, vibrant colors"image = pipe(prompt).images[0]# Save the imageimage.save("futuristic_city.png")
And here’s a simplified example of image captioning using the BLIP model:
from transformers import BlipProcessor, BlipForConditionalGenerationfrom PIL import Imageimport requests
# Load model and processorprocessor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
# Load image (from URL or local file)url = "http://images.cocodataset.org/val2017/000000039769.jpg" # Example: an image of cats
try: # For URL image = Image.open(requests.get(url, stream=True).raw).convert('RGB')
# For local file: # image = Image.open("path/to/your/image.jpg").convert('RGB')
# Prepare inputs inputs = processor(image, return_tensors="pt")
# Generate caption out = model.generate(**inputs) caption = processor.decode(out[0], skip_special_tokens=True)
print(f"Generated caption: {caption}")except Exception as e: print(f"Error: {e}") print("Could not generate caption.")
Text & Audio: Generating speech from text (Text-to-Speech), generating music or sound effects from text (Text-to-Audio).
Text & Video: Generating short video clips from text prompts (Text-to-Video) - a rapidly advancing area.
Image & Audio: Generating sound effects appropriate for an image or vice-versa.
Combined Modalities & Vision-RAG: Models like GPT-4V(ision) or Google’s Gemini can accept and reason about both text and image inputs simultaneously. This enables advanced applications like Vision-RAG (Retrieval-Augmented Generation), where relevant images are first retrieved based on a text query and then analyzed by a VLM to answer the query.
Here’s a conceptual overview using Cohere Embed v4 for retrieval and Google Gemini for answering:
import coherefrom google import genaiimport numpy as npfrom PIL import Imageimport requestsfrom io import BytesIOimport base64
co = cohere.Client("api_key")genai_client = genai.Client(api_key="api_key")
class SimpleImageDB: def __init__(self): self.images = [] # List of (image_url, embedding) tuples
def add_image(self, url, embedding): self.images.append((url, embedding))
def search(self, query_embedding, top_k=3): similarities = [] for url, emb in self.images: # Cosine similarity similarity = np.dot(query_embedding, emb) / (np.linalg.norm(query_embedding) * np.linalg.norm(emb)) similarities.append((url, similarity))
# Sort by similarity (descending) return sorted(similarities, key=lambda x: x[1], reverse=True)[:top_k]
image_db = SimpleImageDB()
example_images = [ "https://example.com/1.jpg", "https://example.com/2.jpg",]
def url_to_data_uri(image_url): """Convert image URL to data URI format required by Cohere API.""" try: response = requests.get(image_url, timeout=10) response.raise_for_status()
content_type = response.headers.get('content-type', '') if 'jpeg' in content_type or 'jpg' in content_type: mime_type = 'image/jpeg' elif 'png' in content_type: mime_type = 'image/png' else: mime_type = 'image/jpeg'
image_b64 = base64.b64encode(response.content).decode('utf-8')
data_uri = f"data:{mime_type};base64,{image_b64}"
# Check size (5MB limit) if len(response.content) > 5 * 1024 * 1024: raise ValueError("Image size exceeds 5MB limit")
return data_uri
except Exception as e: raise Exception(f"Failed to process image URL: {e}")
for img_url in example_images: response = co.embed( texts=None, images=[url_to_data_uri(img_url)], model="embed-v4.0", # Use appropriate model input_type="image" ) image_db.add_image(img_url, response.embeddings[0])
def answer_visual_query(text_query): query_response = co.embed( texts=[text_query], model="embed-v4.0", input_type="text" ) query_embedding = query_response.embeddings[0]
relevant_images = image_db.search(query_embedding, top_k=2)
image_objects = [] for img_url, similarity in relevant_images: try: response = requests.get(img_url) img = Image.open(BytesIO(response.content)) image_objects.append(img) except Exception as e: print(f"Error loading image {img_url}: {e}")
if not image_objects: return "No relevant images found to answer your query."
prompt = f""" I need you to analyze these images to answer the following question: {text_query}
Provide a detailed answer based on what you see in the images. """
response = genai_client.models.generate_content(model="gemini-2.0-flash",contents=[prompt] + image_objects)
return response.text
# Example usageanswer = answer_visual_query("What breed of dog is in these images?")print(answer)
Applications Across Industries
Multimodal AI is transforming various sectors by enabling new and powerful applications:
- Creative Arts & Design: Generating unique artwork, music, video storyboards, fashion designs.
- Marketing & Advertising: Creating ad copy with corresponding images/videos, personalized marketing content.
- Education & Training: Generating interactive learning materials combining text, visuals, and audio.
- Entertainment & Gaming: Creating dynamic game assets, virtual worlds, and interactive narratives.
- Accessibility: Tools for image description, real-time translation including sign language.
- Healthcare: Assisting in analyzing medical images (like X-rays) alongside patient notes.
- Robotics: Enabling robots to understand and interact with the world through vision and language.
Challenges in Multimodal AI
Despite its potential, developing and deploying multimodal AI comes with significant challenges:
- Data Requirements: Requires massive, high-quality datasets aligning different modalities, which can be difficult and expensive to create.
- Computational Cost: Training large multimodal models is even more computationally intensive than training LLMs.
- Evaluation Metrics: Defining and measuring the quality and coherence of generated multimodal content is challenging.
- Alignment and Coherence: Ensuring the generated content across different modalities is consistent and logically connected.
- Ethical Concerns: Similar to LLMs, but amplified; concerns about bias, deepfakes, misinformation spread through manipulated images/videos.
Advanced Multimodal Input Techniques
As multimodal AI capabilities continue to evolve, more sophisticated techniques for processing and responding to multimodal inputs are emerging. Here are some of the most promising developments:
Multimodal Chain-of-Thought Reasoning
Multimodal models can now perform complex reasoning across different input types, analyzing relationships between text, images, and other modalities to arrive at conclusions that would be impossible with single-modality models. For example:
from transformers import AutoProcessor, AutoModelForVision2Seqimport torchfrom PIL import Imageimport requests
# Load model and processorprocessor = AutoProcessor.from_pretrained("google/pix2struct-textcaps-large")model = AutoModelForVision2Seq.from_pretrained("google/pix2struct-textcaps-large")
# Example showing multimodal reasoningdef reason_about_image(image_url, question): # Load image image = Image.open(requests.get(image_url, stream=True).raw)
# Process with model inputs = processor(images=image, text=question, return_tensors="pt")
# Generate step-by-step reasoning outputs = model.generate( **inputs, max_length=512, num_beams=4, early_stopping=True, num_return_sequences=1, length_penalty=1.0, prefix="Let me think through this step-by-step: " )
# Decode and return answer return processor.decode(outputs[0], skip_special_tokens=True)
# Example callreasoning = reason_about_image( "https://example.com/complex_chart.jpg", "What's the relationship between the red and blue lines in this graph, and what conclusions can we draw?")print(reasoning)
This approach allows models to show their thought process, making their conclusions more transparent and interpretable.
Zero-Shot Multimodal Learning
Zero Shot Classification is the task of predicting a class that wasn’t seen by the model during training. This method, which leverages a pre-trained language model, can be thought of as an instance of transfer learning which generally refers to using a model trained for one task in a different application than what it was originally trained for. This is particularly useful for situations where the amount of labeled data is small.
import torchfrom PIL import Imagefrom transformers import CLIPProcessor, CLIPModelimport torch.nn.functional as F
def zero_shot_multimodal_learning_example(): print("--- Zero-Shot Multimodal Learning Example (CLIP) ---")
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
example_image_path = "example-dog.jpg" example_image = Image.open(example_image_path).convert("RGB")
candidate_labels = [ "a photo of a cat", "a photo of a dog", "a photo of a car", "a photo of an airplane", "a photo of a flower" ] for label in candidate_labels: print(f"- {label}")
inputs = processor(text=candidate_labels, images=example_image, return_tensors="pt", padding=True)
with torch.no_grad(): outputs = model(**inputs) image_features = outputs.image_embeds text_features = outputs.text_embeds
image_features = F.normalize(image_features, p=2, dim=-1) text_features = F.normalize(text_features, p=2, dim=-1)
similarity_scores = (image_features @ text_features.T).squeeze(0) print("\nSimilarity scores (raw):") print(similarity_scores)
probabilities = F.softmax(similarity_scores, dim=-1) print("\nProbabilities:") print(probabilities)
predicted_index = torch.argmax(probabilities).item() predicted_label = candidate_labels[predicted_index] predicted_probability = probabilities[predicted_index].item()
print(f"The image is predicted to be: '{predicted_label}'") print(f"With probability: {predicted_probability:.4f}")
if __name__ == "__main__": zero_shot_multimodal_learning_example()
This flexibility allows models to adapt to novel tasks and domains without needing to be retrained.
Foundational Multimodal Models
The most advanced multimodal models now serve as generalist AI systems that can handle virtually any combination of input modalities and perform a wide range of tasks:
from google import genaifrom PIL import Imageimport requestsfrom io import BytesIO# Configure the API (you would need your API key)client = genai.Client(api_key="AIzaSyDA72m1c0ytlCl2LebLXPE6VlOvVBWk3z4")
def process_multimodal_query(text_query, image_urls=None, audio_url=None): content_parts = [text_query]
if image_urls: for url in image_urls: response = requests.get(url) image = Image.open(BytesIO(response.content)) content_parts.append(image)
if audio_url: # In a real implementation, you would process the audio file # and add it to content_parts in the format required by the model pass
response = client.models.generate_content(model='gemini-2.0-flash', contents=content_parts) return response.text
response = process_multimodal_query( "Compare these two drawings. What period are they from, and what are the key stylistic differences?", image_urls=[ "https://example.com/1.jpg", "https://example.com/2.jpg" ])print(response)
These foundation models can seamlessly handle multiple images, text, and in some cases audio or video, providing comprehensive analysis across all provided content.
The Future of Multimodal AI
As we look toward the horizon of multimodal AI development, several key trends are emerging:
Even More Input Modalities: Future models will be able to process additional modalities like touch, smell (via chemical composition), and 3D spatial data.
Cross-Modal Transfer Learning: Models will become increasingly adept at transferring knowledge learned in one modality to improve performance in others.
Multimodal AI Agents: We’ll see the emergence of autonomous AI agents that can perceive the world through multiple senses and take actions across different modalities.
Environmental Understanding: Multimodal models will develop a more holistic understanding of the physical world and its rules, allowing for better reasoning about real-world scenarios.
Personalized Multimodal Interfaces: AI systems will adapt to individual users’ preferred communication modes, shifting seamlessly between text, voice, and visual interfaces.
At Quabyt, we’re excited about the potential of multimodal AI and are actively exploring ways to help our clients harness these capabilities to solve complex problems and create innovative experiences. Whether you’re looking to enhance customer engagement, streamline workflows, or develop entirely new products, the multimodal AI revolution offers unprecedented opportunities for innovation.
Conclusion: Embracing the Multimodal Future
Multimodal AI represents a significant leap forward in our ability to create AI systems that perceive and interact with the world more like humans do. As these technologies mature, they will enable new forms of creative expression, more intuitive human-computer interaction, and powerful tools across industries. Organizations that begin exploring and integrating multimodal AI capabilities now will be well-positioned to leverage these advances as they continue to evolve.
At Quabyt, we’re excited about the potential of multimodal AI and are actively exploring ways to help our clients harness these capabilities to solve complex problems and create innovative experiences. Whether you’re looking to enhance customer engagement, streamline workflows, or develop entirely new products, the multimodal AI revolution offers unprecedented opportunities for innovation.