Beyond Text: Exploring the World of Multimodal Generative AI

Generative AI is moving beyond text. Discover multimodal AI, how it combines text, images, audio, and video, and its transformative potential across industries.

Vinay Adiga

Author

Jun 15, 2025

15 min read

2025 is becoming a key year for AI in business, and multimodal learning is the main reason why. Unlike older models that only handle text, multimodal AI works with images, video, and audio, too. This allows it to understand context much more deeply, leading to smarter, more accurate, and more natural-feeling results.

For businesses, this opens the door to analyzing complex, mixed-media data, simplifying workflows, and making AI-driven insights easier for everyone to use. It’s quickly becoming an essential tool for modern operations.

Here are a few key benefits of this approach:

Enhanced Contextual Understanding: By processing diverse data types (images, video, audio, text), these models achieve a deeper contextual grasp, akin to human intuition. This leads to more accurate interpretations and relevant responses.
Rich, Personalized Outputs: Learning from multiple modalities allows for outputs that are not just precise but also highly customized, fostering more natural and intuitive user interactions.
Advanced Data Applications: This technology empowers businesses to analyze complex mixed-media data, streamline multi-format workflows, and broaden access to AI-driven insights.
Creative Empowerment: Artists, writers, and creators can use these tools to bring their ideas to life in new ways, from instantly illustrating stories to animating concepts.
Improved Accessibility: Multimodal models can make technology more inclusive by describing images for users with visual impairments or translating sign language in real time.

Multimodal Input: The Next Leap in AI Interaction

It’s not just about what AI can create—it’s also about how we interact with it. The latest systems can take in multiple types of input at once, making conversations with AI feel more like talking to another person.

Understanding Multimodal Input Capabilities

Modern multimodal LLMs like GPT-4V, Claude 3 Opus, and Gemini Pro can now process:

Images alongside text queries: “What’s in this image?” or “Can you describe what’s wrong with this circuit diagram?”
Audio alongside text: “Translate this speech recording” or “What emotions do you detect in this audio clip?”
Video understanding: “Summarize what happens in this video clip” or “Identify the key events in this recording”

This enables a much more flexible and powerful way to interact with AI systems that better matches how humans naturally communicate. Here is a Python snippet demonstrating how to use a multimodal model to analyze an image.

# pip install transformers torch pillow requests

from transformers import CLIPProcessor, CLIPModel
import torch
from PIL import Image
import requests

# Load a CLIP model (this can be finetuned on medical data)
model_name = "openai/clip-vit-base-patch32"
model = CLIPModel.from_pretrained(model_name)
processor = CLIPProcessor.from_pretrained(model_name)

# Example image URL
image_url = "https://example.com/chest_xray.jpg"

try:
    response = requests.get(image_url, stream=True)
    response.raise_for_status()
    image = Image.open(response.raw)
except Exception as e:
    print(f"Error loading image: {e}")
    image = Image.new('RGB', (224, 224), color='gray')

# Define possible medical findings
medical_texts = [
    "a normal chest X-ray with clear lungs",
    "chest X-ray showing pneumonia",
    "chest X-ray with pleural effusion",
    "chest X-ray showing pneumothorax",
    "chest X-ray with enlarged heart",
    "chest X-ray showing pulmonary edema",
    "chest X-ray with lung collapse",
    "chest X-ray showing lung nodules",
    "chest X-ray with lung consolidation",
    "chest X-ray showing emphysema"
]

inputs = processor(
    text=medical_texts,
    images=image,
    return_tensors="pt",
    padding=True
)

with torch.no_grad():
    outputs = model(**inputs)

    logits_per_image = outputs.logits_per_image
    probabilities = torch.softmax(logits_per_image, dim=-1)

# Get top predictions
top_indices = torch.argsort(probabilities, descending=True)[0]

print("Medical findings analysis for chest X-ray:")
print("-" * 50)
for i in range(min(5, len(medical_texts))):
    idx = top_indices[i]
    finding = medical_texts[idx]
    confidence = probabilities[0][idx].item()
    print(f"{finding}: {confidence:.3f} ({confidence*100:.1f}%)")

print(f"\nMost likely finding: {medical_texts[top_indices[0]]}")
print(f"Confidence: {probabilities[0][top_indices[0]].item()*100:.1f}%")

Real-Time Multimodal Communication

One of the most exciting developments is the ability to create real-time multimodal communication systems. Libraries like FastRTC are making it easier to build applications that can process audio, video, and text in real-time:

# Example of a voice chat system using FastRTC with Gemini
from fastrtc import Stream, ReplyOnPause, get_stt_model, get_tts_model
from google import genai

stt_model = get_stt_model()
tts_model = get_tts_model()
client = genai.Client(api_key="GEMINI_API_KEY")

def voice_llm_chat(audio):
    # Convert speech to text
    user_message = stt_model.stt(audio)

    response = client.models.generate_content(
        model="gemini-1.5-flash",
        contents=user_message,
        config={"max_output_tokens": 200,}
    )

    # Convert response to speech
    ai_response = response.text
    for audio_chunk in tts_model.stream_tts_sync(ai_response):
        yield audio_chunk

# Create and launch the stream
stream = Stream(ReplyOnPause(voice_llm_chat), modality="audio", mode="send-receive")
stream.ui.launch()

This example demonstrates how easily modern tools can create voice-based AI assistants that understand speech, process it with an LLM, and respond with synthesized speech—all in real-time.

Multimodal Context Windows and Chain-of-Thought

The latest multimodal models can maintain context across different modalities. For example:

This creates a much more dynamic interaction than was previously possible:

# Conceptual example of a multimodal conversation
conversation = [
    {"role": "user", "content": [
        {"type": "text", "text": "Can you explain what's happening in this circuit?"},
        {"type": "image", "image_url": "https://example.com/circuit_diagram.jpg"}
    ]},
    {"role": "assistant", "content": "This appears to be an amplifier circuit using a transistor. The input signal comes in through the capacitor C1 on the left."},
    {"role": "user", "content": "What would happen if I increased R2's resistance?"},
    {"role": "assistant", "content": "If you increase R2 (the resistor connected to the collector), you would increase the gain of the amplifier but potentially reduce the maximum output swing due to the voltage drop across R2."}
]

Video Understanding and Processing

Recent advancements have also enabled AI systems to understand video content, which combines spatial understanding (images) with temporal understanding (how things change over time):

import os
import cv2
import numpy as np
import torch
from transformers import AutoProcessor, AutoModelForVideoClassification

model_name = "MCG-NJU/videomae-base-finetuned-ssv2"

# Load model for video classification
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForVideoClassification.from_pretrained(model_name)

# Ensure model is on GPU if available
if torch.cuda.is_available():
    model.to("cuda")

# Function to process video
def analyze_video(video_path, sample_frames=16):
    if not os.path.exists(video_path):
        print(f"Error: Video file not found at {video_path}")
        return []

    cap = cv2.VideoCapture(video_path)
    if not cap.isOpened():
        print(f"Error: Could not open video file at {video_path}")
        return []

    frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    if frame_count == 0:
        print(f"Error: No frames found in video at {video_path}")
        cap.release()
        return []

    actual_sample_count = min(sample_frames, frame_count)
    indices = np.linspace(0, frame_count - 1, actual_sample_count, dtype=int)

    frames = []
    for i in indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, i)
        ret, frame = cap.read()
        if ret:
            # Convert BGR (OpenCV default) to RGB
            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            frames.append(frame)
    cap.release()

    if not frames:
        print(f"Warning: No frames extracted from {video_path}.")
        return []

    if len(frames) < sample_frames:
        while len(frames) < sample_frames:
            frames.append(frames[-1])
    elif len(frames) > sample_frames:
        frames = frames[:sample_frames]

    inputs = processor(images=frames, return_tensors="pt")

    # Move inputs to GPU if available
    if torch.cuda.is_available():
        inputs = {k: v.to("cuda") for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
    probs = torch.nn.functional.softmax(outputs.logits, dim=-1)


    k = min(3, probs.shape[-1])
    top_prob, top_label_indices = torch.topk(probs, k=k)

    predicted_labels = []
    for prob, label_idx in zip(top_prob[0], top_label_indices[0]):
        predicted_labels.append((model.config.id2label[label_idx.item()], prob.item()))
    return predicted_labels


results = analyze_video("painting.mp4")

if results:
    print("\n--- Video Classification Results ---")
    for label, confidence in results:
        print(f"{label}: {confidence:.4f}")
else:
    print("Could not analyze video or no results obtained.")

With models like Gemini Pro, SmolVLM2, and others, developers can now create applications that understand video content, answer questions about it, and generate descriptions—expanding the frontier of AI capabilities beyond static images.

Key Modalities and Examples

Text & Image: The most mature area, with powerful text-to-image generation and image captioning models. Here’s a simplified example using the Hugging Face diffusers library to generate an image from a text prompt with Stable Diffusion:

from diffusers import StableDiffusionPipeline
import torch

# Load model
model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")  # Move to GPU

# Generate image from text
prompt = "A city after an apocalypse where trees are growing on buildings most buildings are destroyed, futuristic architecture, " \
"overgrown nature, vibrant colors"
image = pipe(prompt).images[0]

# Save the image
image.save("futuristic_city.png")

And here’s a simplified example of image captioning using the BLIP model:

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import requests

# Load model and processor
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

# Load image (from URL or local file)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"  # Example: an image of cats

try:
    # For URL
    image = Image.open(requests.get(url, stream=True).raw).convert('RGB')

    # For local file:
    # image = Image.open("path/to/your/image.jpg").convert('RGB')

    # Prepare inputs
    inputs = processor(image, return_tensors="pt")

    # Generate caption
    out = model.generate(**inputs)
    caption = processor.decode(out[0], skip_special_tokens=True)

    print(f"Generated caption: {caption}")
except Exception as e:
    print(f"Error: {e}")
    print("Could not generate caption.")

Text & Audio: Generating speech from text (Text-to-Speech), generating music or sound effects from text (Text-to-Audio).
Text & Video: Generating short video clips from text prompts (Text-to-Video) - a rapidly advancing area.
Image & Audio: Generating sound effects appropriate for an image or vice-versa.
Combined Modalities & Vision-RAG: Models like GPT-4V(ision) or Google’s Gemini can accept and reason about both text and image inputs simultaneously. This enables advanced applications like Vision-RAG (Retrieval-Augmented Generation), where relevant images are first retrieved based on a text query and then analyzed by a VLM to answer the query.
Here’s a conceptual overview using Cohere Embed v4 for retrieval and Google Gemini for answering:

import cohere
from google import genai
import numpy as np
from PIL import Image
import requests
from io import BytesIO
import base64

co = cohere.Client("api_key")
genai_client = genai.Client(api_key="api_key")


class SimpleImageDB:
    def __init__(self):
        self.images = []  # List of (image_url, embedding) tuples

    def add_image(self, url, embedding):
        self.images.append((url, embedding))

    def search(self, query_embedding, top_k=3):
        similarities = []
        for url, emb in self.images:
            # Cosine similarity
            similarity = np.dot(query_embedding, emb) / (np.linalg.norm(query_embedding) * np.linalg.norm(emb))
            similarities.append((url, similarity))

        # Sort by similarity (descending)
        return sorted(similarities, key=lambda x: x[1], reverse=True)[:top_k]

image_db = SimpleImageDB()

example_images = [
    "https://example.com/1.jpg",
    "https://example.com/2.jpg",
]

def url_to_data_uri(image_url):
        """Convert image URL to data URI format required by Cohere API."""
        try:
            response = requests.get(image_url, timeout=10)
            response.raise_for_status()

            content_type = response.headers.get('content-type', '')
            if 'jpeg' in content_type or 'jpg' in content_type:
                mime_type = 'image/jpeg'
            elif 'png' in content_type:
                mime_type = 'image/png'
            else:
                mime_type = 'image/jpeg'

            image_b64 = base64.b64encode(response.content).decode('utf-8')

            data_uri = f"data:{mime_type};base64,{image_b64}"

            # Check size (5MB limit)
            if len(response.content) > 5 * 1024 * 1024:
                raise ValueError("Image size exceeds 5MB limit")

            return data_uri

        except Exception as e:
            raise Exception(f"Failed to process image URL: {e}")

for img_url in example_images:
    response = co.embed(
        texts=None,
        images=[url_to_data_uri(img_url)],
        model="embed-v4.0",  # Use appropriate model
        input_type="image"
    )
    image_db.add_image(img_url, response.embeddings[0])

def answer_visual_query(text_query):
    query_response = co.embed(
        texts=[text_query],
        model="embed-v4.0",
        input_type="text"
    )
    query_embedding = query_response.embeddings[0]

    relevant_images = image_db.search(query_embedding, top_k=2)

    image_objects = []
    for img_url, similarity in relevant_images:
        try:
            response = requests.get(img_url)
            img = Image.open(BytesIO(response.content))
            image_objects.append(img)
        except Exception as e:
            print(f"Error loading image {img_url}: {e}")

    if not image_objects:
        return "No relevant images found to answer your query."

    prompt = f"""
    I need you to analyze these images to answer the following question:
    {text_query}

    Provide a detailed answer based on what you see in the images.
    """

    response = genai_client.models.generate_content(model="gemini-2.0-flash",contents=[prompt] + image_objects)



    return response.text

# Example usage
answer = answer_visual_query("What breed of dog is in these images?")
print(answer)

Applications Across Industries

Multimodal AI is transforming various sectors by enabling new and powerful applications:

Creative Arts & Design: Generating unique artwork, music, video storyboards, fashion designs.
Marketing & Advertising: Creating ad copy with corresponding images/videos, personalized marketing content.
Education & Training: Generating interactive learning materials combining text, visuals, and audio.
Entertainment & Gaming: Creating dynamic game assets, virtual worlds, and interactive narratives.
Accessibility: Tools for image description, real-time translation including sign language.
Healthcare: Assisting in analyzing medical images (like X-rays) alongside patient notes.
Robotics: Enabling robots to understand and interact with the world through vision and language.

Challenges in Multimodal AI

Despite its potential, developing and deploying multimodal AI comes with significant challenges:

Data Requirements: Requires massive, high-quality datasets aligning different modalities, which can be difficult and expensive to create.
Computational Cost: Training large multimodal models is even more computationally intensive than training LLMs.
Evaluation Metrics: Defining and measuring the quality and coherence of generated multimodal content is challenging.
Alignment and Coherence: Ensuring the generated content across different modalities is consistent and logically connected.
Ethical Concerns: Similar to LLMs, but amplified; concerns about bias, deepfakes, misinformation spread through manipulated images/videos.

Advanced Multimodal Input Techniques

As multimodal AI capabilities continue to evolve, more sophisticated techniques for processing and responding to multimodal inputs are emerging. Here are some of the most promising developments:

Multimodal Chain-of-Thought Reasoning

Multimodal models can now perform complex reasoning across different input types, analyzing relationships between text, images, and other modalities to arrive at conclusions that would be impossible with single-modality models. For example:

from transformers import AutoProcessor, AutoModelForVision2Seq
import torch
from PIL import Image
import requests

# Load model and processor
processor = AutoProcessor.from_pretrained("google/pix2struct-textcaps-large")
model = AutoModelForVision2Seq.from_pretrained("google/pix2struct-textcaps-large")

# Example showing multimodal reasoning
def reason_about_image(image_url, question):
    # Load image
    image = Image.open(requests.get(image_url, stream=True).raw)

    # Process with model
    inputs = processor(images=image, text=question, return_tensors="pt")

    # Generate step-by-step reasoning
    outputs = model.generate(
        **inputs,
        max_length=512,
        num_beams=4,
        early_stopping=True,
        num_return_sequences=1,
        length_penalty=1.0,
        prefix="Let me think through this step-by-step: "
    )

    # Decode and return answer
    return processor.decode(outputs[0], skip_special_tokens=True)

# Example call
reasoning = reason_about_image(
    "https://example.com/complex_chart.jpg",
    "What's the relationship between the red and blue lines in this graph, and what conclusions can we draw?"
)
print(reasoning)

This approach allows models to show their thought process, making their conclusions more transparent and interpretable.

Zero-Shot Multimodal Learning

Zero Shot Classification is the task of predicting a class that wasn’t seen by the model during training. This method, which leverages a pre-trained language model, can be thought of as an instance of transfer learning which generally refers to using a model trained for one task in a different application than what it was originally trained for. This is particularly useful for situations where the amount of labeled data is small.

import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
import torch.nn.functional as F

def zero_shot_multimodal_learning_example():
    print("--- Zero-Shot Multimodal Learning Example (CLIP) ---")

    model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
    processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

    example_image_path = "example-dog.jpg"
    example_image = Image.open(example_image_path).convert("RGB")

    candidate_labels = [
        "a photo of a cat",
        "a photo of a dog",
        "a photo of a car",
        "a photo of an airplane",
        "a photo of a flower"
    ]
    for label in candidate_labels:
        print(f"- {label}")

    inputs = processor(text=candidate_labels, images=example_image, return_tensors="pt", padding=True)

    with torch.no_grad():
        outputs = model(**inputs)
        image_features = outputs.image_embeds
        text_features = outputs.text_embeds

    image_features = F.normalize(image_features, p=2, dim=-1)
    text_features = F.normalize(text_features, p=2, dim=-1)

    similarity_scores = (image_features @ text_features.T).squeeze(0)
    print("\nSimilarity scores (raw):")
    print(similarity_scores)

    probabilities = F.softmax(similarity_scores, dim=-1)
    print("\nProbabilities:")
    print(probabilities)

    predicted_index = torch.argmax(probabilities).item()
    predicted_label = candidate_labels[predicted_index]
    predicted_probability = probabilities[predicted_index].item()

    print(f"The image is predicted to be: '{predicted_label}'")
    print(f"With probability: {predicted_probability:.4f}")

if __name__ == "__main__":
    zero_shot_multimodal_learning_example()

This flexibility allows models to adapt to novel tasks and domains without needing to be retrained.

Foundational Multimodal Models

The most advanced multimodal models now serve as generalist AI systems that can handle virtually any combination of input modalities and perform a wide range of tasks:

from google import genai
from PIL import Image
import requests
from io import BytesIO
# Configure the API (you would need your API key)
client = genai.Client(api_key="AIzaSyDA72m1c0ytlCl2LebLXPE6VlOvVBWk3z4")

def process_multimodal_query(text_query, image_urls=None, audio_url=None):
    content_parts = [text_query]

    if image_urls:
        for url in image_urls:
            response = requests.get(url)
            image = Image.open(BytesIO(response.content))
            content_parts.append(image)

    if audio_url:
        # In a real implementation, you would process the audio file
        # and add it to content_parts in the format required by the model
        pass

    response = client.models.generate_content(model='gemini-2.0-flash', contents=content_parts)
    return response.text

response = process_multimodal_query(
    "Compare these two drawings. What period are they from, and what are the key stylistic differences?",
    image_urls=[
        "https://example.com/1.jpg",
        "https://example.com/2.jpg"
    ]
)
print(response)

These foundation models can seamlessly handle multiple images, text, and in some cases audio or video, providing comprehensive analysis across all provided content.

The Future of Multimodal AI

As we look toward the horizon of multimodal AI development, several key trends are emerging:

Even More Input Modalities: Future models will be able to process additional modalities like touch, smell (via chemical composition), and 3D spatial data.
Cross-Modal Transfer Learning: Models will become increasingly adept at transferring knowledge learned in one modality to improve performance in others.
Multimodal AI Agents: We’ll see the emergence of autonomous AI agents that can perceive the world through multiple senses and take actions across different modalities.
Environmental Understanding: Multimodal models will develop a more holistic understanding of the physical world and its rules, allowing for better reasoning about real-world scenarios.
Personalized Multimodal Interfaces: AI systems will adapt to individual users’ preferred communication modes, shifting seamlessly between text, voice, and visual interfaces.

At Quabyt, we’re excited about the potential of multimodal AI and are actively exploring ways to help our clients harness these capabilities to solve complex problems and create innovative experiences. Whether you’re looking to enhance customer engagement, streamline workflows, or develop entirely new products, the multimodal AI revolution offers unprecedented opportunities for innovation.

Conclusion: Embracing the Multimodal Future

Multimodal AI represents a significant leap forward in our ability to create AI systems that perceive and interact with the world more like humans do. As these technologies mature, they will enable new forms of creative expression, more intuitive human-computer interaction, and powerful tools across industries. Organizations that begin exploring and integrating multimodal AI capabilities now will be well-positioned to leverage these advances as they continue to evolve.