Kernel Deep Dive: Understanding Kernels in Gaussian Processes¶
Kernels (also called covariance functions) are a fundamental component of Gaussian Processes (GPs). They encode our assumptions about the underlying function we are modeling and play a central role in determining the GP's predictions and uncertainty.
What Is a Kernel?¶
Mathematically, a kernel is a function \(k(x, x')\) that defines the similarity or correlation between two points \(x\) and \(x'\) in the input space. In a GP, the kernel determines the covariance matrix \(\mathbf{K}\) for all pairs of input points, which in turn defines the joint distribution over function values.
Given a set of input points \(\mathbf{X} = [x_1, x_2, ..., x_n]\), the GP prior is:
where \(\mathbf{K}_{ij} = k(x_i, x_j)\).
Functionally, the kernel controls:
- The smoothness and complexity of the functions the GP can model.
- How information from observed data points influences predictions at new points.
- The ability to capture periodicity, trends, or other structural properties.
Common Kernels¶
1. Radial Basis Function (RBF) / Squared Exponential / Gaussian Kernel¶
The RBF kernel is the most widely used kernel and assumes the function is infinitely smooth.
- \(\sigma^2\) is the signal variance (controls overall scale).
- \(\ell\) is the lengthscale (controls how quickly correlation decays with distance).
Properties:
- Produces very smooth functions.
- Good default for many problems.
- Implemented in both scikit-optimize and BoTorch.
References:
sklearn RBF kernel
BoTorch RBF kernel
2. Matern Kernel¶
The Matern kernel is a generalization of the RBF kernel with an additional parameter \(\nu\) that controls smoothness.
- \(\nu\) (nu): Smoothness parameter. Common values are 0.5, 1.5, 2.5, and \(\infty\).
- \(\nu = 0.5\): Exponential kernel (less smooth, rougher functions)
- \(\nu = 1.5\): Once differentiable
- \(\nu = 2.5\): Twice differentiable
- \(\nu \to \infty\): Recovers the RBF kernel
- \(K_\nu\) is a modified Bessel function.
Properties:
- Allows control over function roughness.
- Lower \(\nu\) allows modeling rougher, less smooth functions.
- Implemented in both scikit-optimize and BoTorch.
References:
sklearn Matern kernel
BoTorch Matern kernel
3. Rational Quadratic Kernel¶
The Rational Quadratic kernel can be seen as a scale mixture of RBF kernels with different lengthscales.
- \(\alpha\) controls the relative weighting of large-scale and small-scale variations.
- As \(\alpha \to \infty\), the kernel approaches the RBF kernel.
Properties:
- Can model functions with varying smoothness.
- Useful when the function exhibits both short- and long-range correlations.
- Currently implemented in the scikit-optimize backend.
References:
sklearn RationalQuadratic kernel
Anisotropic Kernels and Automatic Relevance Determination (ARD)¶
Isotropic vs. Anisotropic¶
-
Isotropic kernel: Uses a single lengthscale \(\ell\) for all input dimensions.
\(k(x, x') = k(||x - x'||)\)
-
Anisotropic kernel: Uses a separate lengthscale \(\ell_d\) for each input dimension \(d\).
\(k(x, x') = \exp\left( -\sum_{d=1}^D \frac{(x_d - x'_d)^2}{2\ell_d^2} \right)\)
Automatic Relevance Determination (ARD)¶
ARD refers to the process where the model learns a separate lengthscale for each input variable. If a variable is not relevant to the output, its lengthscale will become very large, effectively reducing its influence on the model.
Benefits:
- Helps identify which variables are important for predicting the output.
- Improves interpretability and can lead to more efficient optimization.
Both scikit-optimize and BoTorch support anisotropic kernels and ARD by default.
Choosing a Kernel¶
- RBF: Good default for smooth, well-behaved functions.
- Matern: Use when you expect the function to be less smooth or want to control smoothness. Lower \(\nu\) for rougher functions, higher \(\nu\) for smoother.
- Rational Quadratic: Use when you suspect the function has varying smoothness or both short- and long-range correlations.
Tips:
- If unsure, start with Matern (\(\nu=2.5\) or \(1.5\)) or RBF.
- Try different kernels and compare cross-validation metrics (RMSE, MAE, etc.).
- Use ARD to let the model determine variable relevance.