Gaussian processes (GP) are an incredible class of models. There are very few machine learning algorithms that give you an accurate measure of uncertainty for free while remaining extremely flexible. The problem is that GPs are conceptually very difficult to understand. Most explanations use complex algebra and probability, which is often not helpful in getting an intuition about how these models work.
There are also many excellent guides that ignore the math and give you an intuition about how these models work, but when it comes to using GPs yourself, in the right context, my personal belief is that superficial knowledge will not be enough. That’s why I wanted to walk through a simple implementation, from scratch, so that you have a clearer picture of what’s going on under the hood of all the libraries that implement these patterns for you.
I also link my GitHub repository, where you will find the implementation of GPs using only NumPy. I’ve tried to abstract away the math as much as possible, but there’s obviously still some that’s necessary…
The first step is always to look at the data. We will use monthly atmospheric CO2 concentration over time, measured at the Mauna Loa Observatory, a common dataset for general practitioners (1). This is intentionally the same dataset that Sklearn uses in its General tutorialthat learns how to use their API and not what happens under the hood of the model.
This is a very simple data set, which will make it easier to explain the calculations that follow. Notable features are the linear upward trend as well as the seasonal trend, over a period of one year.
What we’re going to do is separate the seasonal component And linear components Datas. To do this, we fit a linear model to the data.