Model Evaluation

Model evaluation techniques assess how well a trained model performs on unseen data. Selecting the right evaluation method depends on the problem type (classification, regression, ranking) and the nature of the dataset.1


Quick Reference

Task TypePrimary Metrics
Binary ClassificationAccuracy, Precision, Recall, F1, AUC-ROC
Multi-class ClassificationMacro/Micro F1, Confusion Matrix
RegressionMAE, MSE, RMSE, R-squared
RankingMRR, NDCG, MAP

Data Splitting Strategies

Train-Test Split

The dataset is divided into training and testing subsets, commonly using ratios such as 70:30, 80:20, or 60:40. The training set is used to fit the model, while the test set evaluates its generalization performance.2

Cross-Validation

In k-fold cross-validation, the data is partitioned into k equal folds. The model trains on k-1 folds and validates on the remaining fold, repeating k times. Results are averaged to reduce variance and provide a more robust performance estimate.2


Classification Metrics

Confusion Matrix

A confusion matrix displays the counts of correct and incorrect predictions, comparing predicted labels against actual labels.1

Predicted PositivePredicted Negative
Actual PositiveTrue Positive (TP)False Negative (FN)
Actual NegativeFalse Positive (FP)True Negative (TN)
  • True Positive (TP): Correctly predicted positive class
  • True Negative (TN): Correctly predicted negative class
  • False Positive (FP): Incorrectly predicted positive (Type I error)
  • False Negative (FN): Incorrectly predicted negative (Type II error)

Accuracy

Proportion of correct predictions over total predictions:

$$ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} $$

Limitation: Misleading when classes are imbalanced.

Precision

Proportion of correct positive predictions among all predicted positives:

$$ \text{Precision} = \frac{TP}{TP + FP} $$

Use when: False positives are costly (e.g., spam detection).

Recall (Sensitivity)

Proportion of actual positives correctly identified:

$$ \text{Recall} = \frac{TP}{TP + FN} $$

Use when: False negatives are costly (e.g., disease detection).

F1-Score

Harmonic mean of precision and recall, providing a balanced measure:

$$ \text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $$

ROC Curve and AUC

The ROC (Receiver Operating Characteristic) curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various classification thresholds. The AUC (Area Under the Curve) quantifies the model’s ability to distinguish between classes, with values ranging from 0.5 (random) to 1.0 (perfect).3


Regression Metrics

Mean Absolute Error (MAE)

Average absolute difference between predictions and actual values:

$$ \text{MAE} = \frac{1}{n} \sum_i |y_i - \hat{y}_i| $$

Mean Squared Error (MSE)

Average squared difference, penalizing larger errors more heavily:

$$ \text{MSE} = \frac{1}{n} \sum_i (y_i - \hat{y}_i)^2 $$

Root Mean Squared Error (RMSE)

Square root of MSE, in the same units as the target variable:

$$ \text{RMSE} = \sqrt{\text{MSE}} $$

R-squared (Coefficient of Determination)

Proportion of variance in the target explained by the model:

$$ R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2} $$


  • Loss Functions - Functions optimized during training
  • Regularization - Techniques to prevent overfitting
  • Activation Functions - Non-linearities in neural networks

References


  1. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer. https://hastie.su.domains/ElemStatLearn/ ↩︎ ↩︎

  2. Kohavi, R. (1995). A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. IJCAI. https://ai.stanford.edu/~ronnyk/accEst.pdf ↩︎ ↩︎

  3. Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861-874. https://doi.org/10.1016/j.patrec.2005.10.010 ↩︎