Convolutional neural networks (CNNs) learn visual patterns by sliding a small filter (kernel) across an input image or feature map. Two settings control how that sliding happens: stride (how far the filter moves each step) and padding (how many pixels you add around the border). These choices directly determine the output feature map size and the kind of information the network keeps or discards. If you are learning CNN design in an AI course in Kolkata, understanding the maths behind stride and padding helps you debug architectures and predict memory and compute costs before you run a model.
Why stride and padding matter
A convolution layer transforms an input of height HHH and width WWW into an output of height HoutH_{out}Hout and width WoutW_{out}Wout. If you change the output size unintentionally, you can break later layers (for example, when shapes no longer match in residual connections) or lose fine spatial details too early. Stride is mainly responsible for spatial downsampling, while padding mainly affects boundary preservation and whether the output shrinks.
The core output-dimension formula
For a 2D convolution with:
- kernel size Kh×KwK_h times K_wKh×Kw
- stride Sh×SwS_h times S_wSh×Sw
- padding Ph×PwP_h times P_wPh×Pw
- (assuming dilation D=1D=1D=1 for simplicity)
the output sizes are:
Hout=⌊H+2Ph−KhSh⌋+1H_{out} = leftlfloor frac{H + 2P_h – K_h}{S_h} rightrfloor + 1Hout=⌊ShH+2Ph−Kh⌋+1 Wout=⌊W+2Pw−KwSw⌋+1W_{out} = leftlfloor frac{W + 2P_w – K_w}{S_w} rightrfloor + 1Wout=⌊SwW+2Pw−Kw⌋+1
The floor operation is important. It means if the filter does not “fit” neatly, the layer drops the leftover part at the end rather than producing a partial convolution. This is why certain combinations of H,K,S,PH, K, S, PH,K,S,P cause off-by-one surprises.
Quick example
Input H=W=32H=W=32H=W=32, kernel K=3K=3K=3, stride S=1S=1S=1, padding P=1P=1P=1:
Hout=⌊32+2(1)−31⌋+1=32H_{out} = leftlfloor frac{32 + 2(1) – 3}{1} rightrfloor + 1 = 32Hout=⌊132+2(1)−3⌋+1=32
So the output stays 32×3232 times 3232×32. This is often called “same” behaviour.
Stride as controlled downsampling
Stride S>1S>1S>1 makes the filter jump more pixels per move. That reduces output dimensions and acts like downsampling:
- With S=1S=1S=1, the model evaluates the filter at every neighbouring location.
- With S=2S=2S=2, it evaluates every second location, roughly halving spatial resolution.
Example: Input 32×3232 times 3232×32, K=3K=3K=3, P=1P=1P=1, S=2S=2S=2:
Hout=⌊32+2−32⌋+1=⌊312⌋+1=15+1=16H_{out} = leftlfloor frac{32 + 2 – 3}{2} rightrfloor + 1 = leftlfloor frac{31}{2} rightrfloor + 1 = 15 + 1 = 16Hout=⌊232+2−3⌋+1=⌊231⌋+1=15+1=16
So 32→1632 to 1632→16. This is why strided convolutions are frequently used instead of max pooling.
Practical implication: Stride reduces compute and increases receptive field growth per layer, but it can also remove fine details (edges, small objects). In many architectures, early layers keep S=1S=1S=1 to capture detail, then later layers downsample.
Padding and boundary preservation
Padding adds a border around the input, typically zeros (zero-padding), though other modes exist (reflect, replicate). Padding is used to:
- Prevent the feature map from shrinking too quickly.
- Allow filters to “see” the boundary pixels as often as central pixels.
- Control the alignment needed for skip connections or concatenations.
“Valid” vs “Same”
- Valid convolution usually means P=0P=0P=0. Output shrinks because the kernel cannot be centred at the border.
- Same convolution (common in many frameworks) chooses padding so output size matches input size when S=1S=1S=1. For odd kernels like K=3K=3K=3, P=1P=1P=1 keeps dimensions constant.
Boundary effect: With no padding, border information is used fewer times, so edge features can be underrepresented. With padding, borders contribute more consistently, but zero-padding can introduce artificial edges if not handled carefully.
If you are implementing models after an AI course in Kolkata, this is a common source of confusion: padding does not just “keep size”; it changes what information the network can use at the edges.
Common design choices and pitfalls
1) Choosing stride and padding to match shapes
If later layers expect specific dimensions (for example, flattening into a dense layer), compute sizes in advance using the formula. Small differences compound quickly across layers.
2) Odd vs even kernels
Odd kernels (3, 5, 7) make symmetric padding straightforward. Even kernels (2, 4) often lead to asymmetric padding or shape shifts because there is no single “centre” pixel.
3) Stride can cause aliasing
Downsampling by stride can discard high-frequency detail. Some designs reduce this with anti-aliasing strategies (like blur pooling), but even without extra layers, you should be aware that aggressive early downsampling can hurt performance on small objects.
4) Padding choice affects feature quality at borders
Zero-padding is simple and popular, but reflect padding can preserve natural continuity in images. The “best” choice depends on the task and data.
Conclusion
Stride and padding are not minor tuning knobs. They are mathematical controls over how resolution changes and how borders are treated. Stride primarily decides the rate of spatial downsampling, while padding determines whether the convolution preserves dimensions and how much edge context is retained. Once you can compute feature map sizes confidently and reason about boundary effects, you can design CNN architectures that are stable, efficient, and less error-prone. This is a core skill reinforced in any practical AI course in Kolkata because it helps you move from “trial and error” to predictable model building.
