Kernel Deep Dive: Understanding Kernels in Gaussian Processes¶

Kernels (also called covariance functions) are a fundamental component of Gaussian Processes (GPs). They encode our assumptions about the underlying function we are modeling and play a central role in determining the GP's predictions and uncertainty.

What Is a Kernel?¶

Mathematically, a kernel is a function \(k(x, x')\) that defines the similarity or correlation between two points \(x\) and \(x'\) in the input space. In a GP, the kernel determines the covariance matrix \(\mathbf{K}\) for all pairs of input points, which in turn defines the joint distribution over function values.

Given a set of input points \(\mathbf{X} = [x_1, x_2, ..., x_n]\), the GP prior is:

\[ f(\mathbf{X}) \sim \mathcal{N}(\mu(\mathbf{X}), \mathbf{K}) \]

where \(\mathbf{K}_{ij} = k(x_i, x_j)\).

Functionally, the kernel controls:

The smoothness and complexity of the functions the GP can model.
How information from observed data points influences predictions at new points.
The ability to capture periodicity, trends, or other structural properties.

Common Kernels¶

1. Radial Basis Function (RBF) / Squared Exponential / Gaussian Kernel¶

The RBF kernel is the most widely used kernel and assumes the function is infinitely smooth.

\[ k_{\text{RBF}}(x, x') = \sigma^2 \exp\left( -\frac{||x - x'||^2}{2\ell^2} \right) \]

\(\sigma^2\) is the signal variance (controls overall scale).
\(\ell\) is the lengthscale (controls how quickly correlation decays with distance).

Properties:

Produces very smooth functions.
Good default for many problems.
Implemented in both scikit-optimize and BoTorch.

References:
sklearn RBF kernel
BoTorch RBF kernel

2. Matern Kernel¶

The Matern kernel is a generalization of the RBF kernel with an additional parameter \(\nu\) that controls smoothness.

\[ k_{\text{Matern}}(x, x') = \sigma^2 \frac{2^{1-\nu}}{\Gamma(\nu)} \left( \frac{\sqrt{2\nu} ||x - x'||}{\ell} \right)^\nu K_\nu \left( \frac{\sqrt{2\nu} ||x - x'||}{\ell} \right) \]

\(\nu\) (nu): Smoothness parameter. Common values are 0.5, 1.5, 2.5, and \(\infty\).
- \(\nu = 0.5\): Exponential kernel (less smooth, rougher functions)
- \(\nu = 1.5\): Once differentiable
- \(\nu = 2.5\): Twice differentiable
- \(\nu \to \infty\): Recovers the RBF kernel
\(K_\nu\) is a modified Bessel function.

Properties:

Allows control over function roughness.
Lower \(\nu\) allows modeling rougher, less smooth functions.
Implemented in both scikit-optimize and BoTorch.

References:
sklearn Matern kernel
BoTorch Matern kernel

3. Rational Quadratic Kernel¶

The Rational Quadratic kernel can be seen as a scale mixture of RBF kernels with different lengthscales.

\[ k_{\text{RQ}}(x, x') = \sigma^2 \left( 1 + \frac{||x - x'||^2}{2\alpha \ell^2} \right)^{-\alpha} \]

\(\alpha\) controls the relative weighting of large-scale and small-scale variations.
As \(\alpha \to \infty\), the kernel approaches the RBF kernel.

Properties:

Can model functions with varying smoothness.
Useful when the function exhibits both short- and long-range correlations.
Currently implemented in the scikit-optimize backend.

References:
sklearn RationalQuadratic kernel

Anisotropic Kernels and Automatic Relevance Determination (ARD)¶

Isotropic vs. Anisotropic¶

Isotropic kernel: Uses a single lengthscale \(\ell\) for all input dimensions.

\(k(x, x') = k(||x - x'||)\)
Anisotropic kernel: Uses a separate lengthscale \(\ell_d\) for each input dimension \(d\).

\(k(x, x') = \exp\left( -\sum_{d=1}^D \frac{(x_d - x'_d)^2}{2\ell_d^2} \right)\)

Automatic Relevance Determination (ARD)¶

ARD refers to the process where the model learns a separate lengthscale for each input variable. If a variable is not relevant to the output, its lengthscale will become very large, effectively reducing its influence on the model.

Benefits:

Helps identify which variables are important for predicting the output.
Improves interpretability and can lead to more efficient optimization.

Both scikit-optimize and BoTorch support anisotropic kernels and ARD by default.

Choosing a Kernel¶

RBF: Good default for smooth, well-behaved functions.
Matern: Use when you expect the function to be less smooth or want to control smoothness. Lower \(\nu\) for rougher functions, higher \(\nu\) for smoother.
Rational Quadratic: Use when you suspect the function has varying smoothness or both short- and long-range correlations.

Tips:

If unsure, start with Matern (\(\nu=2.5\) or \(1.5\)) or RBF.
Try different kernels and compare cross-validation metrics (RMSE, MAE, etc.).
Use ARD to let the model determine variable relevance.

Kernel Deep Dive: Understanding Kernels in Gaussian Processes¶

What Is a Kernel?¶

Common Kernels¶

1. Radial Basis Function (RBF) / Squared Exponential / Gaussian Kernel¶

2. Matern Kernel¶

3. Rational Quadratic Kernel¶

Anisotropic Kernels and Automatic Relevance Determination (ARD)¶

Isotropic vs. Anisotropic¶

Automatic Relevance Determination (ARD)¶

Choosing a Kernel¶

Further Reading¶