Activation Functions

An activation function is a mathematical function applied within the nodes of a neural network to determine the output of that node. Activation functions introduce non-linearity into the network, enabling it to learn complex patterns beyond simple linear relationships.1

Without activation functions, a neural network would behave as a single linear function regardless of depth:

$$ f(x) = w_1​x_1 ​+ w_2​x_2 ​+ b $$


Summary Table

FunctionRangeDifferentiablePrimary Use Case
Binary Step{0, 1}NoEarly perceptrons
Linear(-∞, ∞)YesRegression output
Sigmoid(0, 1)YesBinary classification
Tanh(-1, 1)YesHidden layers
ReLU[0, ∞)Yes (except at 0)CNNs, deep networks
Leaky ReLU(-∞, ∞)YesDeep networks
Softmax(0, 1)YesMulti-class output
GELU(-∞, ∞)YesTransformers

Binary Step Function2

The binary step function outputs a 0 or 1 depending on whether the input is below or above a threshold. It is simple but not differentiable, making it unsuitable for backpropagation-based learning. $$ f(x) = \begin{cases} 1, & x \ge 0 \ 0, & x < 0 \end{cases} $$ Use Case: Primarily used in early perceptrons for binary classification.

Linear Function 1

A linear function maintains proportionality between input and output. It’s useful for regression but not for complex non-linear learning. $$ f(x) = a x $$ Use Case: Output layers in regression networks.

Sigmoid Function 3

The sigmoid squashes input values into the range (0, 1). It’s smooth and differentiable, making it suitable for probabilistic outputs, though it can suffer from vanishing gradients. $$ f(x) = \frac{1}{1 + e^{-x}} $$ Use Case: Binary classification output layers.

Hyperbolic Tangent (tanh) 4

The tanh function outputs between -1 and 1, centering data and often outperforming sigmoid in hidden layers due to its zero-centered property. $$ f(x) = \tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}} $$ Use Case: Hidden layers in classical feedforward networks.

ReLU (Rectified Linear Unit) 5

ReLU replaces all negative values with zero, allowing faster convergence but can suffer from “dying ReLU” when neurons output zero permanently. $$ f(x) = \max(0, x) $$ Use Case: Hidden layers in CNNs and deep neural networks.

Leaky ReLU 6

Leaky ReLU addresses ReLU’s dying neuron problem by allowing a small, non-zero gradient for negative inputs. $$ f(x) = \begin{cases} x, & x > 0 \ \alpha x, & x \le 0 \end{cases} $$ Use Case: Deep networks where gradient flow preservation is important.

Parametric ReLU (PReLU) 7

PReLU generalizes Leaky ReLU by learning the slope of the negative part during training. $$ f(x) = \begin{cases} x, & x > 0 \ a x, & x \le 0 \end{cases} $$ Use Case: Deep CNNs to improve model adaptability.

Exponential Linear Unit (ELU) 8

ELU smooths the transition for negative values, improving learning speed and reducing bias shifts. $$ f(x) = \begin{cases} x, & x > 0 \ \alpha (e^{x} - 1), & x \le 0 \end{cases} $$ Use Case: Deep networks with noisy or normalized inputs.

Softmax

Softmax converts a vector of real numbers into probabilities that sum to 1. It is typically used in the final layer for multi-class classification.9 $$ f(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $$ Use Case: Output layers for multi-class problems.

Swish 10

Description: Swish is a smooth, non-monotonic function that outperforms ReLU in many tasks. It’s defined as x multiplied by its sigmoid. $$ f(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}} $$ Use Case: State-of-the-art deep networks (e.g., EfficientNet).

GELU (Gaussian Error Linear Unit) 11

Description: GELU combines ideas from dropout and ReLU, weighting inputs by their probability of being positive. It provides smoother activation than ReLU. $$ f(x) = x \Phi(x) $$ where ( \Phi(x) ) is the Gaussian cumulative distribution function (CDF).
Use Case: Transformer architectures (e.g., BERT).

SELU (Scaled Exponential Linear Unit)12

SELU automatically normalizes activations to zero mean and unit variance, enabling self-normalizing networks without explicit batch normalization.

$$ f(x) = \begin{cases} \lambda x, & x > 0 \ \lambda \alpha (e^{x} - 1), & x \le 0 \end{cases} $$

where $\alpha \approx 1.6733$ and $\lambda \approx 1.0507$ are fixed scaling constants derived mathematically.

Use Case: Self-normalizing networks, particularly feedforward architectures.


Choosing an Activation Function

ScenarioRecommended Function
Hidden layers (general)ReLU or Leaky ReLU
Transformer modelsGELU
Binary classification outputSigmoid
Multi-class classification outputSoftmax
Regression outputLinear (no activation)
Self-normalizing networksSELU
State-of-the-art vision modelsSwish

  • Neural Networks - How activation functions fit into network architecture
  • Loss Functions - Measuring error for training
  • Optimization - How gradients flow through activations

References