Skip to content

Kernel Deep Dive: Understanding Kernels in Gaussian Processes

Kernels (also called covariance functions) are a fundamental component of Gaussian Processes (GPs). They encode our assumptions about the underlying function we are modeling and play a central role in determining the GP's predictions and uncertainty.


What Is a Kernel?

Mathematically, a kernel is a function \(k(x, x')\) that defines the similarity or correlation between two points \(x\) and \(x'\) in the input space. In a GP, the kernel determines the covariance matrix \(\mathbf{K}\) for all pairs of input points, which in turn defines the joint distribution over function values.

Given a set of input points \(\mathbf{X} = [x_1, x_2, ..., x_n]\), the GP prior is:

\[ f(\mathbf{X}) \sim \mathcal{N}(\mu(\mathbf{X}), \mathbf{K}) \]

where \(\mathbf{K}_{ij} = k(x_i, x_j)\).

Functionally, the kernel controls:

  • The smoothness and complexity of the functions the GP can model.
  • How information from observed data points influences predictions at new points.
  • The ability to capture periodicity, trends, or other structural properties.

Common Kernels

1. Radial Basis Function (RBF) / Squared Exponential / Gaussian Kernel

The RBF kernel is the most widely used kernel and assumes the function is infinitely smooth.

\[ k_{\text{RBF}}(x, x') = \sigma^2 \exp\left( -\frac{||x - x'||^2}{2\ell^2} \right) \]
  • \(\sigma^2\) is the signal variance (controls overall scale).
  • \(\ell\) is the lengthscale (controls how quickly correlation decays with distance).

Properties:

  • Produces very smooth functions.
  • Good default for many problems.
  • Implemented in both scikit-optimize and BoTorch.

References:
sklearn RBF kernel
BoTorch RBF kernel


2. Matern Kernel

The Matern kernel is a generalization of the RBF kernel with an additional parameter \(\nu\) that controls smoothness.

\[ k_{\text{Matern}}(x, x') = \sigma^2 \frac{2^{1-\nu}}{\Gamma(\nu)} \left( \frac{\sqrt{2\nu} ||x - x'||}{\ell} \right)^\nu K_\nu \left( \frac{\sqrt{2\nu} ||x - x'||}{\ell} \right) \]
  • \(\nu\) (nu): Smoothness parameter. Common values are 0.5, 1.5, 2.5, and \(\infty\).
    • \(\nu = 0.5\): Exponential kernel (less smooth, rougher functions)
    • \(\nu = 1.5\): Once differentiable
    • \(\nu = 2.5\): Twice differentiable
    • \(\nu \to \infty\): Recovers the RBF kernel
  • \(K_\nu\) is a modified Bessel function.

Properties:

  • Allows control over function roughness.
  • Lower \(\nu\) allows modeling rougher, less smooth functions.
  • Implemented in both scikit-optimize and BoTorch.

References:
sklearn Matern kernel
BoTorch Matern kernel


3. Rational Quadratic Kernel

The Rational Quadratic kernel can be seen as a scale mixture of RBF kernels with different lengthscales.

\[ k_{\text{RQ}}(x, x') = \sigma^2 \left( 1 + \frac{||x - x'||^2}{2\alpha \ell^2} \right)^{-\alpha} \]
  • \(\alpha\) controls the relative weighting of large-scale and small-scale variations.
  • As \(\alpha \to \infty\), the kernel approaches the RBF kernel.

Properties:

  • Can model functions with varying smoothness.
  • Useful when the function exhibits both short- and long-range correlations.
  • Currently implemented in the scikit-optimize backend.

References:
sklearn RationalQuadratic kernel


Anisotropic Kernels and Automatic Relevance Determination (ARD)

Isotropic vs. Anisotropic

  • Isotropic kernel: Uses a single lengthscale \(\ell\) for all input dimensions.

    \(k(x, x') = k(||x - x'||)\)

  • Anisotropic kernel: Uses a separate lengthscale \(\ell_d\) for each input dimension \(d\).

    \(k(x, x') = \exp\left( -\sum_{d=1}^D \frac{(x_d - x'_d)^2}{2\ell_d^2} \right)\)

Automatic Relevance Determination (ARD)

ARD refers to the process where the model learns a separate lengthscale for each input variable. If a variable is not relevant to the output, its lengthscale will become very large, effectively reducing its influence on the model.

Benefits:

  • Helps identify which variables are important for predicting the output.
  • Improves interpretability and can lead to more efficient optimization.

Both scikit-optimize and BoTorch support anisotropic kernels and ARD by default.


Choosing a Kernel

  • RBF: Good default for smooth, well-behaved functions.
  • Matern: Use when you expect the function to be less smooth or want to control smoothness. Lower \(\nu\) for rougher functions, higher \(\nu\) for smoother.
  • Rational Quadratic: Use when you suspect the function has varying smoothness or both short- and long-range correlations.

Tips:

  • If unsure, start with Matern (\(\nu=2.5\) or \(1.5\)) or RBF.
  • Try different kernels and compare cross-validation metrics (RMSE, MAE, etc.).
  • Use ARD to let the model determine variable relevance.

Further Reading