Activation Functions
An activation function is a mathematical function applied within the nodes of a neural network to determine the output of that node. Activation functions introduce non-linearity into the network, enabling it to learn complex patterns beyond simple linear relationships.1
Without activation functions, a neural network would behave as a single linear function regardless of depth:
$$ f(x) = w_1x_1 + w_2x_2 + b $$
Summary Table
| Function | Range | Differentiable | Primary Use Case |
|---|---|---|---|
| Binary Step | {0, 1} | No | Early perceptrons |
| Linear | (-∞, ∞) | Yes | Regression output |
| Sigmoid | (0, 1) | Yes | Binary classification |
| Tanh | (-1, 1) | Yes | Hidden layers |
| ReLU | [0, ∞) | Yes (except at 0) | CNNs, deep networks |
| Leaky ReLU | (-∞, ∞) | Yes | Deep networks |
| Softmax | (0, 1) | Yes | Multi-class output |
| GELU | (-∞, ∞) | Yes | Transformers |
Binary Step Function2
The binary step function outputs a 0 or 1 depending on whether the input is below or above a threshold. It is simple but not differentiable, making it unsuitable for backpropagation-based learning. $$ f(x) = \begin{cases} 1, & x \ge 0 \ 0, & x < 0 \end{cases} $$ Use Case: Primarily used in early perceptrons for binary classification.
Linear Function 1
A linear function maintains proportionality between input and output. It’s useful for regression but not for complex non-linear learning. $$ f(x) = a x $$ Use Case: Output layers in regression networks.
Sigmoid Function 3
The sigmoid squashes input values into the range (0, 1). It’s smooth and differentiable, making it suitable for probabilistic outputs, though it can suffer from vanishing gradients. $$ f(x) = \frac{1}{1 + e^{-x}} $$ Use Case: Binary classification output layers.
Hyperbolic Tangent (tanh) 4
The tanh function outputs between -1 and 1, centering data and often outperforming sigmoid in hidden layers due to its zero-centered property. $$ f(x) = \tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}} $$ Use Case: Hidden layers in classical feedforward networks.
ReLU (Rectified Linear Unit) 5
ReLU replaces all negative values with zero, allowing faster convergence but can suffer from “dying ReLU” when neurons output zero permanently. $$ f(x) = \max(0, x) $$ Use Case: Hidden layers in CNNs and deep neural networks.
Leaky ReLU 6
Leaky ReLU addresses ReLU’s dying neuron problem by allowing a small, non-zero gradient for negative inputs. $$ f(x) = \begin{cases} x, & x > 0 \ \alpha x, & x \le 0 \end{cases} $$ Use Case: Deep networks where gradient flow preservation is important.
Parametric ReLU (PReLU) 7
PReLU generalizes Leaky ReLU by learning the slope of the negative part during training. $$ f(x) = \begin{cases} x, & x > 0 \ a x, & x \le 0 \end{cases} $$ Use Case: Deep CNNs to improve model adaptability.
Exponential Linear Unit (ELU) 8
ELU smooths the transition for negative values, improving learning speed and reducing bias shifts. $$ f(x) = \begin{cases} x, & x > 0 \ \alpha (e^{x} - 1), & x \le 0 \end{cases} $$ Use Case: Deep networks with noisy or normalized inputs.
Softmax
Softmax converts a vector of real numbers into probabilities that sum to 1. It is typically used in the final layer for multi-class classification.9 $$ f(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $$ Use Case: Output layers for multi-class problems.
Swish 10
Description: Swish is a smooth, non-monotonic function that outperforms ReLU in many tasks. It’s defined as x multiplied by its sigmoid. $$ f(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}} $$ Use Case: State-of-the-art deep networks (e.g., EfficientNet).
GELU (Gaussian Error Linear Unit) 11
Description: GELU combines ideas from dropout and ReLU, weighting inputs by their probability of being positive. It provides smoother activation than ReLU.
$$
f(x) = x \Phi(x)
$$
where ( \Phi(x) ) is the Gaussian cumulative distribution function (CDF).
Use Case: Transformer architectures (e.g., BERT).
SELU (Scaled Exponential Linear Unit)12
SELU automatically normalizes activations to zero mean and unit variance, enabling self-normalizing networks without explicit batch normalization.
$$ f(x) = \begin{cases} \lambda x, & x > 0 \ \lambda \alpha (e^{x} - 1), & x \le 0 \end{cases} $$
where $\alpha \approx 1.6733$ and $\lambda \approx 1.0507$ are fixed scaling constants derived mathematically.
Use Case: Self-normalizing networks, particularly feedforward architectures.
Choosing an Activation Function
| Scenario | Recommended Function |
|---|---|
| Hidden layers (general) | ReLU or Leaky ReLU |
| Transformer models | GELU |
| Binary classification output | Sigmoid |
| Multi-class classification output | Softmax |
| Regression output | Linear (no activation) |
| Self-normalizing networks | SELU |
| State-of-the-art vision models | Swish |
Related Topics
- Neural Networks - How activation functions fit into network architecture
- Loss Functions - Measuring error for training
- Optimization - How gradients flow through activations
References
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. ↩︎ ↩︎
Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review. ↩︎
Han, J., Kamber, M., & Pei, J. (2012). Data Mining: Concepts and Techniques. Morgan Kaufmann. ↩︎
LeCun, Y., Bottou, L., Orr, G. B., & Müller, K. R. (2012). Efficient BackProp. Springer. ↩︎
Nair, V., & Hinton, G. E. (2010). Rectified Linear Units Improve Restricted Boltzmann Machines. ICML. ↩︎
Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier Nonlinearities Improve Neural Network Acoustic Models. ICML. ↩︎
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV. ↩︎
Clevert, D. A., Unterthiner, T., & Hochreiter, S. (2015). Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). arXiv:1511.07289. ↩︎
Bridle, J. S. (1990). Training Stochastic Model Recognition Algorithms as Networks Can Lead to Maximum Mutual Information Estimation of Parameters. NIPS. ↩︎
Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for Activation Functions. arXiv:1710.05941. ↩︎
Hendrycks, D., & Gimpel, K. (2016). Gaussian Error Linear Units (GELUs). arXiv:1606.08415. ↩︎
Klambauer, G., Unterthiner, T., Mayr, A., & Hochreiter, S. (2017). Self-Normalizing Neural Networks. NeurIPS. ↩︎