Computer Vision in 2022: The Year Transformers Won

I wrote about Vision Transformers back in late 2020 when the original ViT paper dropped. At the time, my take was cautious. Transformers are interesting for vision, but CNNs still have the edge for practical use, especially on-device. Two years later, I was wrong. Not completely wrong, but wrong enough that it's worth writing about.

2022 is the year transformers became the default backbone for computer vision. This is personal for me, because I spent two years at Myelin building CNN-based models, and now as I'm starting grad school, the entire landscape has shifted.

The Papers That Changed Things

Swin Transformer (2021, but the effects hit in 2022). Liu et al. at Microsoft introduced shifted windows, solving ViT's biggest practical problem. Original ViT applied global self-attention to all patches, which is O(n^2) and doesn't scale to high-resolution images. Swin computes attention within local windows, then shifts those windows between layers to enable cross-window connections. Linear complexity. Hierarchical features. Suddenly, transformers could be used as general-purpose backbones for detection and segmentation, not just classification.

Swin V2 pushed this further with bigger models (3 billion parameters) and higher resolution (1536x1536). The results on COCO and ADE20K were state-of-the-art by a significant margin.

DeiT (Data-efficient Image Transformers). Touvron et al. at Meta showed you don't need 300 million images to train a vision transformer. Using knowledge distillation from a CNN teacher, strong data augmentation, and careful regularization, DeiT matched ViT performance training only on ImageNet. This was the paper that made ViTs practical for people without Google-scale data.

BEiT and MAE. Self-supervised pre-training arrived for vision. BERT-style masked image modeling, where you mask patches and train the model to reconstruct them, turned out to work brilliantly. MAE (Masked Autoencoders) from Meta showed that masking 75% of patches and reconstructing them produces excellent representations. This mirrors what happened in NLP: self-supervised pre-training unlocked the real potential of the architecture.

Why CNNs Lost Ground

The thing is, CNNs didn't suddenly get worse. ResNets, EfficientNets, ConvNeXts, they're still excellent. ConvNeXt in particular showed that if you "modernize" a ResNet with transformer-era design choices (larger kernels, LayerNorm, fewer activation functions), CNNs can match Swin performance.

But the ecosystem shifted. New papers started defaulting to transformer backbones. Benchmark leaderboards filled with Swin and ViT variants. The tooling, HuggingFace, timm, MMDetection, all prioritized transformer support. When the community moves, individual architecture merits matter less than the momentum.

What This Means Practically

If you're starting a new CV project in 2022, your default backbone should probably be a Swin Transformer or a ViT variant, not a ResNet. The exceptions are edge deployment (where CNN efficiency still matters) and small datasets (where CNN inductive biases help).

For detection: Swin + Cascade R-CNN or DINO (the detection transformer, not the self-supervised method, confusing naming, I know). The YOLOv9 vs RT-DETR debate would later crystallize this transformer-vs-CNN tension in detection specifically.

For segmentation: Swin + UPerNet or Mask2Former.

For classification: DeiT or ViT with MAE pre-training.

My Take, Studying This in Real Time

I'm reading these papers as part of my coursework at Northeastern, and it's a strange experience. The CNN knowledge I built at Myelin is still valuable, the fundamentals of feature extraction, receptive fields, and multi-scale processing all carry over. The transformer shift has since expanded beyond classification into image enhancement and restoration as well. But the implementation details, the architectures I knew by heart, those are becoming historical context rather than current practice.

That's actually one of the reasons I came to grad school. The field moves so fast that two years of industry experience can feel dated if you don't invest in keeping your foundations current. Better to learn this in a classroom than discover it in a failed production deployment.

Computer Vision in 2022: The Year Transformers Won

The Papers That Changed Things

Why CNNs Lost Ground

What This Means Practically

My Take, Studying This in Real Time

Related Posts

Transformers for Image Enhancement: Beyond Classification

Vision Transformers Are Coming for CNNs

Multimodal Models Are the New Default: GPT-4V, Gemini, and Beyond