·4 min read

Vision Transformers Are Coming for CNNs

computer-visiontransformersdeep-learning

Last month, Google published "An Image is Worth 16x16 Words" and introduced ViT (Vision Transformer). A model that applies the transformer architecture from NLP directly to images. No convolutions. No pooling. Just self-attention on image patches.

And it works. Really well. I read the paper twice because I wasn't sure I believed it the first time.

Comparing CNN and Vision Transformer architectures: CNNs use hierarchical convolution and pooling, while ViTs split images into patches and apply self-attention.

How ViT Works

The idea is surprisingly simple:

  1. Split the image into patches. A 224x224 image becomes a sequence of 196 patches of 16x16 pixels each.
  2. Flatten and embed each patch. Each 16×16×3 patch becomes a 768-dimensional vector via a linear projection.
  3. Add positional embeddings. So the model knows where each patch was in the original image.
  4. Run through a standard transformer encoder. Multi-head self-attention, layer normalization, MLP blocks. Same architecture as BERT.
  5. Classify from the CLS token. Prepend a learnable classification token, read the output.

That's it. No inductive bias about local patterns, no hierarchical feature extraction. Just attention over patches.

Why This Matters

CNNs have been the default for computer vision since AlexNet in 2012. Eight years of convolutions, pooling, skip connections, and ever-deeper architectures. The entire CV toolbox is built around CNNs.

ViT suggests that maybe convolutions aren't necessary. Given enough data (ViT was pre-trained on JFT-300M, a 300-million image dataset), the transformer can learn to extract visual features purely through attention.

The implication is that one architecture can handle both text and images. Unified models that process any modality through the same mechanism. That's a big deal for multimodal AI. OpenAI's CLIP is already proving this by connecting vision and language in a single embedding space.

The Catch

ViT needs a LOT of data. When trained only on ImageNet (1.3M images), ViT underperforms comparable CNNs. The transformer's lack of inductive bias means it needs more examples to learn what CNNs get for free through the convolution operation.

So this isn't "CNNs are dead." It's "with enough data, transformers match or beat CNNs, and they're more flexible."

What I'm Watching

Data efficiency. Can we get ViT-level performance without 300M images? DeiT (Data-efficient Image Transformers) from Facebook is already making progress, using distillation and data augmentation instead of massive datasets.

Hybrid models. Using convolutions in early layers to extract local features, then switching to transformer blocks for global reasoning. Best of both worlds.

Self-supervised pre-training. BERT-style pre-training for vision: mask patches and predict them. If this works well, we'd have a truly general-purpose pre-training paradigm.

Edge deployment. ViTs are attention-heavy, and self-attention is O(n²) with sequence length. Running this efficiently on mobile is an open problem. For now, MobileNet variants are still more practical for on-device work.

My Take

As someone who works primarily in computer vision, ViT feels like a paradigm shift. Not because it's immediately better than CNNs for practical applications (it's not, yet). But because it shows that the architectural walls between NLP and CV are coming down.

The trend is clear: transformers are eating everything. Language, vision, audio, protein structures. The question isn't whether transformers will dominate CV. It's how quickly the practical concerns get solved.