Image Super-Resolution Reconstruction Based on ResNet (SRResNet)
This note documents my practice on image super-resolution based on ResNet: task background, model architecture, training pipeline, and results. Comparisons with SRCNN and SRGAN are included. The full project is available on my GitHub.
Abstract
Single-image super-resolution (SISR) reconstructs a high-resolution (HR) image from a low-resolution (LR) input. I implemented SRResNet based on residual networks (ResNet): the backbone stacks multiple residual blocks (each contains two convolutions with a skip connection), and the tail uses pixel-shuffle for efficient upsampling. Training uses the COCO2014 train/val splits; the loss is MSE and the optimizer is Adam. On three example categories— portraits, remote sensing, and astronomical backgrounds—SRResNet significantly improves fine details. Compared to GAN-based methods, SRGAN can produce sharper edges but may introduce texture artifacts; SRResNet yields a more balanced and stable result.
1. Task and Data
- Objective: reconstruct HR from LR (an image-to-image regression task).
- Dataset: COCO2014 (train2014 + val2014), natural images across many categories.
- Samples: training inputs are downsampled LR; labels are the original HR.

2. Model Design — SRResNet
2.1 Residual Block
Conventional layer-wise mapping (y=f(x)) can accumulate approximation error
and suffer gradient decay. Residual learning reformulates the mapping as
(y=f(x)+x); the skip connection enables deeper networks with more stable
training.

- In this implementation, each residual block contains two convolutional sub‑blocks
(
conv_block1/conv_block2) with matched input/output channels. In the forward pass, the inputxis added to the output ofconv_block2to form the residual.
2.2 Pixel Shuffle
Common upsampling schemes (bilinear, transposed convolution) may incur information loss or checkerboard artifacts. Pixel shuffle first expands channels and then rearranges them to a higher spatial resolution:
- e.g., map ((C,H,W)) via a convolution to ((r^2 C,H,W));
- then rearrange channels to ((C,rH,rW)), achieving an upscaling factor of (r).
2.3 Generator
- The generator backbone is SRResNet: it takes LR as input and outputs SR; the forward pass performs feature extraction and upsampling within SRResNet.
The overall model framework is shown below:

3. Training Setup
-
Loss: mean squared error (MSE), measuring the pixel‑wise discrepancy between (\hat{I}{SR}) and (I{HR}):
L_MSE = mean( (I_HR - I_SR)^2 ) -
Optimizer: Adam.
-
Procedure: load data → forward → compute MSE → backprop & update → periodically save checkpoints and logs.
4. Comparative Methods
4.1 SRCNN (early CNN‑based approach)
A minimal 3‑layer convolutional structure: feature extraction (9×9) → nonlinear
mapping (1×1) → reconstruction (5×5), conceptually aligned with sparse‑coding
pipelines. The CNN structure is shown below:

4.2 SRGAN (adversarial approach)
Perceptual quality is optimized within a GAN framework: the generator mirrors
SRResNet, and the discriminator distinguishes real HR from generated SR. The loss
combines perceptual (content/adversarial) terms with regularization. It can
enhance detail but may introduce texture artifacts. The architecture is shown below:

5. Results (three representative scenarios)
Left: LR (input) | Middle: SRGAN (reference) | Right: SRResNet (this work)
Observation: SRGAN tends to sharpen edges but may introduce texture artifacts; SRResNet is more balanced and stable (consistent with the qualitative summary).
6. Environment
- Language: Python; deep‑learning framework: PyTorch
- GPU: NVIDIA GeForce RTX 3060, CUDA 11.8 (local)
- Cloud: PaddlePaddle/online environment (for experiments/comparisons)
7. References
- Ledig et al., “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network,” CVPR 2017
- SRCNN and survey literature on super-resolution (see reference lists and online resources)