Image Super-Resolution in 2020: From SRCNN to ESRGAN
I've been spending a lot of time with super-resolution models lately, and I have to say, the progress in this field over the last few years is genuinely wild. I remember trying bicubic upscaling during undergrad and thinking "there has to be a better way." Turns out there is. Several, actually.
Here's a quick tour of how we got from blurry bilinear upscaling to models that hallucinate realistic textures.
The Problem
Take a low-resolution image and produce a high-resolution version. Simple to state, incredibly hard to do well. Traditional approaches (bicubic interpolation, Lanczos) give you smooth but blurry results. The goal with deep learning is to add plausible detail that wasn't in the original.
SRCNN (2014)
The paper that started it all. Dong et al. showed that even a 3-layer CNN could beat traditional upscaling methods. The architecture was dead simple: extract patches, map them non-linearly, reconstruct. The results weren't amazing by today's standards, but proving that CNNs could do this at all was the breakthrough.
VDSR and Deeper Networks (2016)
Kim et al. went deeper with 20 layers and residual learning. The key insight was learning the residual (difference between low-res and high-res) instead of the full output. Faster convergence, better results. This became the standard pattern.
EDSR (2017)
Enhanced Deep SR removed batch normalization from the residual blocks (turns out BN actually hurts SR performance) and scaled up the model. Won the NTIRE 2017 challenge. Clean architecture, strong results.
SRGAN (2017)
This is where things got really interesting. Ledig et al. used a GAN-based approach with a perceptual loss function instead of just pixel-wise MSE. The difference is huge. MSE loss gives you smooth, safe outputs. Perceptual + adversarial loss gives you sharp, textured outputs that look real.
The trade-off? Sometimes the GAN hallucinates details that weren't there. For photos, this is usually fine. For medical imaging, it's a problem.
ESRGAN (2018)
The current sweet spot, and the one I work with most. Wang et al. improved on SRGAN with (and later, Real-ESRGAN would take this even further with realistic degradation modeling):
- RRDB blocks (Residual-in-Residual Dense Blocks) instead of standard ResBlocks
- No batch normalization
- Relativistic discriminator
- Better perceptual loss using pre-activation VGG features
The outputs are noticeably sharper and more natural than SRGAN.
What I've Learned in Practice
A few things that papers don't always tell you:
Training data matters more than architecture. A mediocre model trained on good, diverse data beats a fancy model trained on limited data. Spend time on your dataset.
Downscaling method matters. If you train on bicubic-downscaled images but your real inputs have JPEG compression artifacts, motion blur, or noise, the model falls apart. Train on realistic degradations.
4x upscaling is the sweet spot. 2x is easy, 8x is mostly hallucination. 4x gives you real improvement with reasonable fidelity.
Perceptual quality is not the same as PSNR. The models with the highest PSNR scores often look the worst to human eyes. Always evaluate visually, not just by metrics.
What's Next
Real-time SR on mobile devices is becoming feasible with quantized and architecture-searched models. Running ESRGAN-quality output in a browser or on a phone in real-time? That's the frontier I'm most excited about right now. I've been experimenting with running SR models directly in the browser with TensorFlow.js, and the results are promising for lightweight architectures like ESPCN. And looking further ahead, transformer-based approaches are starting to push the quality ceiling even higher.
Related Posts
Transformers for Image Enhancement: Beyond Classification
Vision Transformers aren't just for classification anymore. They're rewriting the rules for low-level vision tasks like enhancement and restoration.
Real-ESRGAN Changed Super-Resolution Forever
Real-ESRGAN handles real-world degradation in a way previous models never could. As someone who built SR models at Myelin, this one hit different.
Vision Transformers Are Coming for CNNs
Google just showed that a pure transformer, no convolutions at all, can match the best CNNs on image classification. The implications are huge.