ยท5 min read

Multimodal Models Are the New Default: GPT-4V, Gemini, and Beyond

multimodalllmsurvey

I'm finishing my MS this month. It's December 2024, and I wanted to take stock of something that happened quietly over the past year: multimodal AI went from a research direction to the default.

GPT-4V, Gemini, Claude, LLaVA, Qwen-VL. The leading models don't just process text anymore. They see images, interpret diagrams, analyze charts, and in some cases handle audio and video natively. For someone who spent years working in computer vision, this convergence feels like the most significant shift in the field since transformers replaced CNNs.

Multimodal model architecture: separate modality encoders feed into a unified transformer that reasons across text, vision, and audio simultaneously.

The State of Multimodal in 2024

GPT-4V demonstrated that a language model could reason about images with genuine understanding. Not just object detection or classification, but compositional reasoning: "explain why this meme is funny" or "what's wrong with this circuit diagram." That's a qualitative leap from anything CLIP offered when it first connected text and images.

Gemini launched with native multimodal capabilities, processing text, images, audio, and video within a single architecture. Google's bet is that separate models for separate modalities are a dead end, and I think they're right.

Open-source caught up fast. LLaVA showed that visual instruction tuning on top of open language models could produce competitive vision-language capabilities. Qwen-VL, CogVLM, and others followed. The gap between proprietary and open-source multimodal models narrowed dramatically through 2024.

Claude added vision. Anthropic's Claude gained image understanding capabilities, making it another strong contender in the multimodal space with a focus on reliability and nuance in visual reasoning.

What This Means for Applications

Use cases that were impossible two years ago are now straightforward.

Document understanding at scale. Upload a complex PDF with charts, tables, and figures, and get a coherent summary that integrates all modalities. Financial reports, medical records, engineering schematics. This used to require specialized OCR pipelines, layout detection, figure extraction, and text processing. Now a single model handles it.

Visual question answering that actually works. Point a camera at a math problem, a broken appliance, a plant disease, and get a useful explanation. Not a label, not a bounding box, but actual reasoning about what's in the image and what to do about it.

Creative workflows transformed. Designers can iterate with AI that understands both their text descriptions and visual references simultaneously. The feedback loop between ideation and execution tightened considerably.

The CV Perspective

I've spent a significant portion of my career in computer vision. CNNs for super-resolution, object detection for industrial inspection, edge deployment of vision models. The field I trained in is being absorbed into something larger.

That's not a loss. It's a convergence. The techniques I learned, attention mechanisms, feature extraction, spatial reasoning, are now components of models that also understand language. The skills transfer directly. But the framing has shifted. "Computer vision" as a standalone discipline is merging with NLP and audio processing into a unified field of multimodal AI.

Vision Transformers started this convergence by proving that the same architecture could process both images and text. Multimodal foundation models completed it by training on all modalities simultaneously.

The Bigger Picture

We're moving toward unified foundation models that understand the world the way humans do: through multiple sensory channels processed together. Text, images, audio, video, and eventually other modalities like 3D and tactile data, all in one model.

The implications go beyond capability. It changes how we think about AI systems architecture. Instead of orchestrating five specialized models with a complex pipeline, you use one model that natively handles everything. That's simpler to build, easier to maintain, and often produces better results because the model can reason across modalities instead of stitching together separate analyses.

As I close out my MS and look at where the field is headed, this feels like the most consequential trend of 2024. Not because any single multimodal model is perfect, but because the direction is now irreversible. The future of AI is multimodal by default. The era of text-only models is already behind us.