Normalization (machine learning)
Abstract
Normalization (machine learning)
In machine learning, normalization is a statistical technique with various applications. There are two main forms of normalization, namely data normalization and activation normalization. Data normalization (or feature scaling) includes methods that rescale input data so that the features have the same range, mean, variance, or other statistical properties. For instance, a popular choice of feature scaling method is min-max normalization, where each feature is transformed to have the same range (typically
). This solves the problem of different features having vastly different scales, for example if one feature is measured in kilometers and another in nanometers.
Activation normalization, on the other hand, is specific to deep learning, and includes methods that rescale the activation of hidden neurons inside neural networks.
Normalization is often used to:
increase the speed of training convergence, reduce sensitivity to variations and feature scales in input data, reduce overfitting, and produce better model generalization to unseen data. Normalization techniques are often theoretically justified as reducing covariance shift, smoothing optimization landscapes, and increasing regularization, though they are mainly justified by empirical success.
== Batch normalization ==
Batch normalization (BatchNorm) operates on the activations of a layer for each mini-batch.
Consider a simple feedforward network, defined by chaining together modules: BatchNorm is a module that can be inserted at any point in the feedforward network. For example, suppose it is inserted just after
, then the network would operate accordingly: Concretely, suppose we have a batch of inputs
, fed all at once into the network. We would obtain in the middle of the network some vectors:
-th coordinate of each vector in the batch, and computing the mean and variance of these numbers.
It then normalizes each coordinate to have zero mean and unit variance: Finally, it applies a linear transformation: The following is a Python implementation of BatchNorm:
=== Interpretation === It is claimed in the original publication that BatchNorm works by reducing internal covariance shift, though the claim has both supporters and detractors.
=== Special cases ===
The original paper recommended to only use BatchNorms after a linear transform, not after a nonlinear activation. That is,
, not
. Also, the bias
. That is, if a BatchNorm is preceded by a linear transform, then that linear transform's bias term is set to zero.
For convolutional neural networks (CNNs), BatchNorm must preserve the translation-invariance of these models, meaning that it must treat all outputs of the same kernel as if they are different data points within a batch. This is sometimes called Spatial BatchNorm, or BatchNorm2D, or per-channel BatchNorm.
Concretely, suppose we have a 2-dimensional convolutional layer defined by:
-th channel of the
-th layer.
, with indices
.
-th channel of the
-th layer.
In order to preserve the translational invariance, BatchNorm treats all outputs from the same kernel in the same batch as more data in a batch. That is, it is applied once per kernel
), not per activation
:
ΞΌ
c
(
l
)
=
1
B
H
W
β
b
=
1
B
β
h
=
1
H
β
w
=
1
W
x
(
b
)
,
h
,
w
,
c
(
l
)
(
Ο
c
(
l
)
)
2
=
1
B
H
W
β
b
=
1
B
β
...
(Article truncated for display)
Source
This content is sourced from Wikipedia, the free encyclopedia. Read full article on Wikipedia
Category
Machine Learning - Data Science