Image Super-Resolution Reconstruction Based on ResNet (SRResNet)

This note documents my practice on image super-resolution based on ResNet: task background, model architecture, training pipeline, and results. Comparisons with SRCNN and SRGAN are included. The full project is available on my GitHub.

Abstract

Single-image super-resolution (SISR) reconstructs a high-resolution (HR) image from a low-resolution (LR) input. I implemented SRResNet based on residual networks (ResNet): the backbone stacks multiple residual blocks (each contains two convolutions with a skip connection), and the tail uses pixel-shuffle for efficient upsampling. Training uses the COCO2014 train/val splits; the loss is MSE and the optimizer is Adam. On three example categories— portraits, remote sensing, and astronomical backgrounds—SRResNet significantly improves fine details. Compared to GAN-based methods, SRGAN can produce sharper edges but may introduce texture artifacts; SRResNet yields a more balanced and stable result.

1. Task and Data

Objective: reconstruct HR from LR (an image-to-image regression task).
Dataset: COCO2014 (train2014 + val2014), natural images across many categories.
Samples: training inputs are downsampled LR; labels are the original HR.

2. Model Design — SRResNet

2.1 Residual Block

Conventional layer-wise mapping (y=f(x)) can accumulate approximation error and suffer gradient decay. Residual learning reformulates the mapping as (y=f(x)+x); the skip connection enables deeper networks with more stable training.
Residual Block

In this implementation, each residual block contains two convolutional sub‑blocks (conv_block1/conv_block2) with matched input/output channels. In the forward pass, the input x is added to the output of conv_block2 to form the residual.

2.2 Pixel Shuffle

Common upsampling schemes (bilinear, transposed convolution) may incur information loss or checkerboard artifacts. Pixel shuffle first expands channels and then rearranges them to a higher spatial resolution:

e.g., map ((C,H,W)) via a convolution to ((r^2 C,H,W));
then rearrange channels to ((C,rH,rW)), achieving an upscaling factor of (r).

2.3 Generator

The generator backbone is SRResNet: it takes LR as input and outputs SR; the forward pass performs feature extraction and upsampling within SRResNet.

The overall model framework is shown below:

Residual Block

3. Training Setup

Loss: mean squared error (MSE), measuring the pixel‑wise discrepancy between (\hat{I}{SR}) and (I{HR}):
L_MSE = mean( (I_HR - I_SR)^2 )
Optimizer: Adam.
Procedure: load data → forward → compute MSE → backprop & update → periodically save checkpoints and logs.

4. Comparative Methods

4.1 SRCNN (early CNN‑based approach)

A minimal 3‑layer convolutional structure: feature extraction (9×9) → nonlinear mapping (1×1) → reconstruction (5×5), conceptually aligned with sparse‑coding pipelines. The CNN structure is shown below:
SRCNN

4.2 SRGAN (adversarial approach)

Perceptual quality is optimized within a GAN framework: the generator mirrors SRResNet, and the discriminator distinguishes real HR from generated SR. The loss combines perceptual (content/adversarial) terms with regularization. It can enhance detail but may introduce texture artifacts. The architecture is shown below:
SRGAN Architecture

5. Results (three representative scenarios)

Left: LR (input) | Middle: SRGAN (reference) | Right: SRResNet (this work)

Observation: SRGAN tends to sharpen edges but may introduce texture artifacts; SRResNet is more balanced and stable (consistent with the qualitative summary).

6. Environment

Language: Python; deep‑learning framework: PyTorch
GPU: NVIDIA GeForce RTX 3060, CUDA 11.8 (local)
Cloud: PaddlePaddle/online environment (for experiments/comparisons)

7. References

Ledig et al., “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network,” CVPR 2017
SRCNN and survey literature on super-resolution (see reference lists and online resources)