Activation Functions

An activation function is a mathematical function applied within the nodes of a neural network to determine the output of that node. Activation functions introduce non-linearity into the network, enabling it to learn complex patterns beyond simple linear relationships.¹

Without activation functions, a neural network would behave as a single linear function regardless of depth:

$$ f(x) = w_1x_1 + w_2x_2 + b $$

Summary Table

Function	Range	Differentiable	Primary Use Case
Binary Step	{0, 1}	No	Early perceptrons
Linear	(-∞, ∞)	Yes	Regression output
Sigmoid	(0, 1)	Yes	Binary classification
Tanh	(-1, 1)	Yes	Hidden layers
ReLU	[0, ∞)	Yes (except at 0)	CNNs, deep networks
Leaky ReLU	(-∞, ∞)	Yes	Deep networks
Softmax	(0, 1)	Yes	Multi-class output
GELU	(-∞, ∞)	Yes	Transformers

Binary Step Function²

The binary step function outputs a 0 or 1 depending on whether the input is below or above a threshold. It is simple but not differentiable, making it unsuitable for backpropagation-based learning. $$ f(x) = \begin{cases} 1, & x \ge 0 \ 0, & x < 0 \end{cases} $$ Use Case: Primarily used in early perceptrons for binary classification.

Linear Function ¹

A linear function maintains proportionality between input and output. It’s useful for regression but not for complex non-linear learning. $$ f(x) = a x $$ Use Case: Output layers in regression networks.

Sigmoid Function ³

The sigmoid squashes input values into the range (0, 1). It’s smooth and differentiable, making it suitable for probabilistic outputs, though it can suffer from vanishing gradients. $$ f(x) = \frac{1}{1 + e^{-x}} $$ Use Case: Binary classification output layers.

Hyperbolic Tangent (tanh) ⁴

The tanh function outputs between -1 and 1, centering data and often outperforming sigmoid in hidden layers due to its zero-centered property. $$ f(x) = \tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}} $$ Use Case: Hidden layers in classical feedforward networks.

ReLU (Rectified Linear Unit) ⁵

ReLU replaces all negative values with zero, allowing faster convergence but can suffer from “dying ReLU” when neurons output zero permanently. $$ f(x) = \max(0, x) $$ Use Case: Hidden layers in CNNs and deep neural networks.

Leaky ReLU ⁶

Leaky ReLU addresses ReLU’s dying neuron problem by allowing a small, non-zero gradient for negative inputs. $$ f(x) = \begin{cases} x, & x > 0 \ \alpha x, & x \le 0 \end{cases} $$ Use Case: Deep networks where gradient flow preservation is important.

Parametric ReLU (PReLU) ⁷

PReLU generalizes Leaky ReLU by learning the slope of the negative part during training. $$ f(x) = \begin{cases} x, & x > 0 \ a x, & x \le 0 \end{cases} $$ Use Case: Deep CNNs to improve model adaptability.

Exponential Linear Unit (ELU) ⁸

ELU smooths the transition for negative values, improving learning speed and reducing bias shifts. $$ f(x) = \begin{cases} x, & x > 0 \ \alpha (e^{x} - 1), & x \le 0 \end{cases} $$ Use Case: Deep networks with noisy or normalized inputs.

Softmax

Softmax converts a vector of real numbers into probabilities that sum to 1. It is typically used in the final layer for multi-class classification.⁹ $$ f(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $$ Use Case: Output layers for multi-class problems.

Swish ¹⁰

Description: Swish is a smooth, non-monotonic function that outperforms ReLU in many tasks. It’s defined as x multiplied by its sigmoid. $$ f(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}} $$ Use Case: State-of-the-art deep networks (e.g., EfficientNet).

GELU (Gaussian Error Linear Unit) ¹¹

Description: GELU combines ideas from dropout and ReLU, weighting inputs by their probability of being positive. It provides smoother activation than ReLU. $$ f(x) = x \Phi(x) $$ where ( \Phi(x) ) is the Gaussian cumulative distribution function (CDF).
Use Case: Transformer architectures (e.g., BERT).

SELU (Scaled Exponential Linear Unit)¹²

SELU automatically normalizes activations to zero mean and unit variance, enabling self-normalizing networks without explicit batch normalization.

$$ f(x) = \begin{cases} \lambda x, & x > 0 \ \lambda \alpha (e^{x} - 1), & x \le 0 \end{cases} $$

where $\alpha \approx 1.6733$ and $\lambda \approx 1.0507$ are fixed scaling constants derived mathematically.

Use Case: Self-normalizing networks, particularly feedforward architectures.

Choosing an Activation Function

Scenario	Recommended Function
Hidden layers (general)	ReLU or Leaky ReLU
Transformer models	GELU
Binary classification output	Sigmoid
Multi-class classification output	Softmax
Regression output	Linear (no activation)
Self-normalizing networks	SELU
State-of-the-art vision models	Swish

Neural Networks - How activation functions fit into network architecture
Loss Functions - Measuring error for training
Optimization - How gradients flow through activations

Activation Functions

Activation Functions

Summary Table

Binary Step Function²

Linear Function ¹

Sigmoid Function ³

Hyperbolic Tangent (tanh) ⁴

ReLU (Rectified Linear Unit) ⁵

Leaky ReLU ⁶

Parametric ReLU (PReLU) ⁷

Exponential Linear Unit (ELU) ⁸

Softmax

Swish ¹⁰

GELU (Gaussian Error Linear Unit) ¹¹

SELU (Scaled Exponential Linear Unit)¹²

Choosing an Activation Function

References

🔗 Referenced By

Activation Functions

Summary Table

Binary Step Function2

Linear Function 1

Sigmoid Function 3

Hyperbolic Tangent (tanh) 4

ReLU (Rectified Linear Unit) 5

Leaky ReLU 6

Parametric ReLU (PReLU) 7

Exponential Linear Unit (ELU) 8

Softmax

Swish 10

GELU (Gaussian Error Linear Unit) 11

SELU (Scaled Exponential Linear Unit)12

Choosing an Activation Function

Related Topics

References

🔗 Referenced By

Binary Step Function²

Linear Function ¹

Sigmoid Function ³

Hyperbolic Tangent (tanh) ⁴

ReLU (Rectified Linear Unit) ⁵

Leaky ReLU ⁶

Parametric ReLU (PReLU) ⁷

Exponential Linear Unit (ELU) ⁸

Swish ¹⁰

GELU (Gaussian Error Linear Unit) ¹¹

SELU (Scaled Exponential Linear Unit)¹²