Gaussian distribution.jl

### A Pluto.jl notebook ###
# v0.20.4

using Markdown
using InteractiveUtils

# ╔═╡ 0dd544c8-b7c6-11ef-106b-99e6f84894c3
md"""
# Continuous Data and the Gaussian Distribution

"""

# ╔═╡ 0dd817c0-b7c6-11ef-1f8b-ff0f59f7a8ce
md"""
### Preliminaries

Goal 

  * Review of information processing with Gaussian distributions in linear systems

Materials        

  * Mandatory

      * These lecture notes
  * Optional

      * Bishop pp. 85-93
      * [MacKay - 2006 - The Humble Gaussian Distribution](https://github.com/bertdv/BMLIP/blob/master/lessons/notebooks/files/Mackay-2006-The-humble-Gaussian-distribution.pdf) (highly recommended!)
      * [Ariel Caticha - 2012 - Entropic Inference and the Foundations of Physics](https://github.com/bertdv/BMLIP/blob/master/lessons/notebooks/files/Caticha-2012-Entropic-Inference-and-the-Foundations-of-Physics.pdf), pp.30-34, section 2.8, the Gaussian distribution
  * References

      * [E.T. Jaynes - 2003 - Probability Theory, The Logic of Science](http://www.med.mcgill.ca/epidemiology/hanley/bios601/GaussianModel/JaynesProbabilityTheory.pdf) (best book available on the Bayesian view on probability theory)

"""

# ╔═╡ 0dd82814-b7c6-11ef-3927-b3ec0b632c31
md"""
### Example Problem

Consider a set of observations ``D=\{x_1,…,x_N\}`` in the 2-dimensional plane (see Figure). All observations were generated by the same process. We now draw an extra observation ``x_\bullet = (a,b)`` from the same data generating process. What is the probability that ``x_\bullet`` lies within the shaded rectangle ``S``?

"""

# ╔═╡ 0dd82864-b7c6-11ef-097a-b5861a1f8411
using Pkg; Pkg.activate("../."); Pkg.instantiate();
using IJulia; try IJulia.clear_output(); catch _ end

# ╔═╡ 0dd8288c-b7c6-11ef-347d-f55f7ef817d2
using Distributions, Plots, LaTeXStrings

N = 100
generative_dist = MvNormal([0,1.], [0.8 0.5; 0.5 1.0])

D = rand(generative_dist, N)                                            # Generate observations from generative_dist
scatter(D[1,:], D[2,:], marker=:x, markerstrokewidth=3, label=L"D")
x_dot = rand(generative_dist)                                           # Generate x∙
scatter!([x_dot[1]], [x_dot[2]], label=L"x_\bullet")
plot!(range(0, 2), [1., 1., 1.], fillrange=2, alpha=0.4, color=:gray,label=L"S")


# ╔═╡ 0dd835ca-b7c6-11ef-0e33-1329e4ba13d8
md"""
### The Gaussian Distribution

Consider a random (vector) variable ``x \in \mathbb{R}^M`` that is "normally" (i.e., Gaussian) distributed. The *moment* parameterization of the Gaussian distribution is completely specified by its *mean* ``\mu`` and *variance* ``\Sigma`` and given by

```math
p(x | \mu, \Sigma) = \mathcal{N}(x|\mu,\Sigma) \triangleq \frac{1}{\sqrt{(2\pi)^M |\Sigma|}} \,\exp\left\{-\frac{1}{2}(x-\mu)^T \Sigma^{-1} (x-\mu) \right\}\,.
```

where ``|\Sigma| \triangleq \mathrm{det}(\Sigma)`` is the determinant of ``\Sigma``.  

For the scalar real variable ``x \in \mathbb{R}``, this works out to 

```math
p(x | \mu, \sigma^2) =  \frac{1}{\sqrt{2\pi\sigma^2 }} \,\exp\left\{-\frac{(x-\mu)^2}{2 \sigma^2} \right\}\,.
```

"""

# ╔═╡ 0dd84542-b7c6-11ef-3115-0f8b26aeaa5d
md"""
Alternatively, the <a id="natural-parameterization">*canonical* (a.k.a. *natural*  or *information* ) parameterization</a> of the Gaussian distribution is given by

```math
\begin{equation*}
p(x | \eta, \Lambda) = \mathcal{N}_c(x|\eta,\Lambda)  = \exp\left\{ a + \eta^T x - \frac{1}{2}x^T \Lambda x \right\}\,.
\end{equation*}
```

```math
a = -\frac{1}{2} \left( M \log(2 \pi) - \log |\Lambda| + \eta^T \Lambda \eta\right)
```

is the normalizing constant that ensures that ``\int p(x)\mathrm{d}x = 1``.

```math
\Lambda = \Sigma^{-1}
```

is called the *precision matrix*.

```math
\eta = \Sigma^{-1} \mu
```

is the *natural* mean or for clarity often called the *precision-weighted* mean.

"""

# ╔═╡ 0dd8528a-b7c6-11ef-3bc9-eb09c0c530d8
md"""
### Why the Gaussian?

Why is the Gaussian distribution so ubiquitously used in science and engineering? (see also [Jaynes, section 7.14](http://www.med.mcgill.ca/epidemiology/hanley/bios601/GaussianModel/JaynesProbabilityTheory.pdf#page=250), and the whole chapter 7 in his book).

"""

# ╔═╡ 0dd85c94-b7c6-11ef-06dc-7b8797c13fda
md"""
(1) Operations on probability distributions tend to lead to Gaussian distributions:

  * Any smooth function with single rounded maximum, if raised to higher and higher powers, goes into a Gaussian function. (useful in sequential Bayesian inference).
  * The [Gaussian distribution has higher entropy](https://en.wikipedia.org/wiki/Differential_entropy#Maximization_in_the_normal_distribution) than any other with the same variance. 

      * Therefore any operation on a probability distribution that discards information but preserves variance gets us closer to a Gaussian.
      * As an example, see [Jaynes, section 7.1.4](http://www.med.mcgill.ca/epidemiology/hanley/bios601/GaussianModel/JaynesProbabilityTheory.pdf#page=250) for how this leads to the [Central Limit Theorem](https://en.wikipedia.org/wiki/Central_limit_theorem), which results from performing convolution operations on distributions.

"""

# ╔═╡ 0dd8677a-b7c6-11ef-357f-2328b10f5274
md"""
(2) Once the Gaussian has been attained, this form tends to be preserved. e.g.,   

  * The convolution of two Gaussian functions is another Gaussian function (useful in sum of 2 variables and linear transformations)
  * The product of two Gaussian functions is another Gaussian function (useful in Bayes rule).
  * The Fourier transform of a Gaussian function is another Gaussian function.

"""

# ╔═╡ 0dd86f40-b7c6-11ef-2ae8-a3954469bcee
md"""
### Transformations and Sums of Gaussian Variables

A **linear transformation** ``z=Ax+b`` of a Gaussian variable ``x \sim \mathcal{N}(\mu_x,\Sigma_x)`` is Gaussian distributed as

```math
p(z) = \mathcal{N} \left(z \,|\, A\mu_x+b, A\Sigma_x A^T \right) \tag{SRG-4a}
```

In fact, after a linear transformation ``z=Ax+b``, no matter how ``x`` is distributed, the mean and variance of ``z`` are always given by ``\mu_z = A\mu_x + b``  and ``\Sigma_z = A\Sigma_x A^T``, respectively (see   [probability theory review lesson](https://nbviewer.jupyter.org/github/bertdv/BMLIP/blob/master/lessons/notebooks/Probability-Theory-Review.ipynb#linear-transformation)). In case ``x`` is not Gaussian, higher order moments may be needed to specify the distribution for ``z``. 

"""

# ╔═╡ 0dd87a3a-b7c6-11ef-2bc2-bf2b4969537c
md"""
The **sum of two independent Gaussian variables** is also Gaussian distributed. Specifically, if ``x \sim \mathcal{N} \left(\mu_x, \Sigma_x \right)`` and ``y \sim \mathcal{N} \left(\mu_y, \Sigma_y \right)``, then the PDF for ``z=x+y`` is given by

```math
\begin{align*}
p(z) &= \mathcal{N}(x\,|\,\mu_x,\Sigma_x) \ast \mathcal{N}(y\,|\,\mu_y,\Sigma_y) \\
  &= \mathcal{N} \left(z\,|\,\mu_x+\mu_y, \Sigma_x +\Sigma_y \right) \tag{SRG-8}
\end{align*}
```

The sum of two Gaussian *distributions* is NOT a Gaussian distribution. Why not?

"""

# ╔═╡ 0dd88110-b7c6-11ef-0b82-2ffe13a68cad
md"""
### Example: Gaussian Signals in a Linear System

<p style="text-align:center;"><img src="./figures/fig-linear-system.png" width="400px"></p>

Given independent variables

```math
x \sim \mathcal{N}(\mu_x,\sigma_x^2)
```

and ``y \sim \mathcal{N}(\mu_y,\sigma_y^2)``, what is the PDF for ``z = A\cdot(x -y) + b`` ? (for answer, see [Exercises](http://nbviewer.jupyter.org/github/bertdv/BMLIP/blob/master/lessons/exercises/Exercises-The-Gaussian-Distribution.ipynb))

"""

# ╔═╡ 0dd88a84-b7c6-11ef-133c-3d85f0703c19
md"""
Think about the role of the Gaussian distribution for stochastic linear systems in relation to what sinusoidals mean for deterministic linear system analysis.

"""

# ╔═╡ 0dd890ee-b7c6-11ef-04b7-e7671227d8cb
md"""
### Bayesian Inference for the Gaussian

Let's estimate a constant ``\theta`` from one 'noisy' measurement ``x`` about that constant. 

We assume the following measurement equations (the tilde ``\sim`` means: 'is distributed as'):

```math
\begin{align*}
x &= \theta + \epsilon \\
\epsilon &\sim \mathcal{N}(0,\sigma^2)
\end{align*}
```

Also, let's assume a Gaussian prior for ``\theta``

```math
\begin{align*}
\theta &\sim \mathcal{N}(\mu_0,\sigma_0^2) \\
\end{align*}
```

"""

# ╔═╡ 0dd89b6e-b7c6-11ef-2525-73ee0242eb91
md"""
##### Model specification

Note that you can rewrite these specifications in probabilistic notation as follows:

```math
\begin{align*}
    p(x|\theta) &=  \mathcal{N}(x|\theta,\sigma^2) \\
    p(\theta) &=\mathcal{N}(\theta|\mu_0,\sigma_0^2)
\end{align*}
```

"""

# ╔═╡ 0dd8b5d6-b7c6-11ef-1eb9-4f4289261e79
md"""
(**Notational convention**). Note that we write ``\epsilon \sim \mathcal{N}(0,\sigma^2)`` but not ``\epsilon \sim \mathcal{N}(\epsilon | 0,\sigma^2)``, and we write  ``p(\theta) =\mathcal{N}(\theta|\mu_0,\sigma_0^2)`` but not ``p(\theta) =\mathcal{N}(\mu_0,\sigma_0^2)``. 

"""

# ╔═╡ 0dd8c024-b7c6-11ef-3ca4-f9e8286cbb64
md"""
##### Inference

For simplicity, we assume that the variance ``\sigma^2`` is given and will proceed to derive a Bayesian posterior for the mean ``\theta``. The case for Bayesian inference of ``\sigma^2`` with a given mean is [discussed in the optional slides](#inference-for-precision).

"""

# ╔═╡ 0dd8d976-b7c6-11ef-051f-4f6cb3db3d1b
md"""
Let's do Bayes rule for the posterior PDF ``p(\theta|x)``. 

```math
\begin{align*}
p(\theta|x)  &= \frac{p(x|\theta) p(\theta)}{p(x)} \propto p(x|\theta) p(\theta)  \\
    &= \mathcal{N}(x|\theta,\sigma^2) \mathcal{N}(\theta|\mu_0,\sigma_0^2)   \\
    &\propto \exp \left\{   -\frac{(x-\theta)^2}{2\sigma^2} - \frac{(\theta-\mu_0)^2}{2\sigma_0^2} \right\}  \\
    &\propto \exp \left\{ \theta^2 \cdot \left( -\frac{1}{2 \sigma_0^2} - \frac{1}{2\sigma^2}  \right)  + \theta \cdot  \left( \frac{\mu_0}{\sigma_0^2} + \frac{x}{\sigma^2}\right)   \right\} \\
    &= \exp\left\{ -\frac{\sigma_0^2 + \sigma^2}{2 \sigma_0^2 \sigma^2} \left( \theta - \frac{\sigma_0^2 x +  \sigma^2 \mu_0}{\sigma^2 + \sigma_0^2}\right)^2  \right\} 
\end{align*}
```

which we recognize as a Gaussian distribution w.r.t. ``\theta``. 

"""

# ╔═╡ 0dd8df66-b7c6-11ef-011a-8d90bba8e2cd
md"""
(Just as an aside,) this computational 'trick' for multiplying two Gaussians is called **completing the square**. The procedure makes use of the equality 

```math
ax^2+bx+c_1 = a\left(x+\frac{b}{2a}\right)^2+c_2
```

"""

# ╔═╡ 0dd8ea56-b7c6-11ef-0116-691b99023eb5
md"""
In particular, it follows that the posterior for ``\theta`` is

```math
\begin{equation*}
    p(\theta|x) = \mathcal{N} (\theta |\, \mu_1, \sigma_1^2)
\end{equation*}
```

where

```math
\begin{align*}
  \frac{1}{\sigma_1^2}  &= \frac{\sigma_0^2 + \sigma^2}{\sigma^2 \sigma_0^2} = \frac{1}{\sigma_0^2} + \frac{1}{\sigma^2}  \\
  \mu_1   &= \frac{\sigma_0^2 x +  \sigma^2 \mu_0}{\sigma^2 + \sigma_0^2} = \sigma_1^2 \, \left(  \frac{1}{\sigma_0^2} \mu_0 + \frac{1}{\sigma^2} x \right) 
\end{align*}
```

"""

# ╔═╡ 0dd8f1fe-b7c6-11ef-3386-e37f33577577
md"""
### (Multivariate) Gaussian Multiplication

So, multiplication of two Gaussian distributions yields another (unnormalized) Gaussian with

  * posterior precision equals **sum of prior precisions**
  * posterior precision-weighted mean equals **sum of prior precision-weighted means**

"""

# ╔═╡ 0dd8fbe2-b7c6-11ef-1f78-63dfd48146fd
md"""
As we just saw, a Gaussian prior, combined with a Gaussian likelihood, make Bayesian inference analytically solvable (!):

```math
\begin{equation*}
\underbrace{\text{Gaussian}}_{\text{posterior}}
 \propto \underbrace{\text{Gaussian}}_{\text{likelihood}} \times \underbrace{\text{Gaussian}}_{\text{prior}}
\end{equation*}
```

"""

# ╔═╡ 0dd90644-b7c6-11ef-2fcf-2948d45f43bb
md"""
<a id="Gaussian-multiplication"></a>In general, the multiplication of two multi-variate Gaussians over ``x`` yields an (unnormalized) Gaussian over ``x``:

```math
\begin{equation*}
\boxed{\mathcal{N}(x|\mu_a,\Sigma_a) \cdot \mathcal{N}(x|\mu_b,\Sigma_b) = \underbrace{\mathcal{N}(\mu_a|\, \mu_b, \Sigma_a + \Sigma_b)}_{\text{normalization constant}} \cdot \mathcal{N}(x|\mu_c,\Sigma_c)} \tag{SRG-6}
\end{equation*}
```

where

```math
\begin{align*}
\Sigma_c^{-1} &= \Sigma_a^{-1} + \Sigma_b^{-1} \\
\Sigma_c^{-1} \mu_c &= \Sigma_a^{-1}\mu_a + \Sigma_b^{-1}\mu_b
\end{align*}
```

"""

# ╔═╡ 0dd91b7a-b7c6-11ef-1326-7bbfe5ac16bf
md"""
Check out that normalization constant ``\mathcal{N}(\mu_a|\, \mu_b, \Sigma_a + \Sigma_b)``. Amazingly, this constant can also be expressed by a Gaussian!

"""

# ╔═╡ 0dd9264e-b7c6-11ef-0fa9-d3e4e5053654
md"""
```math
\Rightarrow
```

Note that Bayesian inference is trivial in the [*canonical* parameterization of the Gaussian](#natural-parameterization), where we would get

```math
\begin{align*}
 \Lambda_c &= \Lambda_a + \Lambda_b  \quad &&\text{(precisions add)}\\
 \eta_c &= \eta_a + \eta_b \quad &&\text{(precision-weighted means add)}
\end{align*}
```

This property is an important reason why the canonical parameterization of the Gaussian distribution is useful in Bayesian data processing. 

"""

# ╔═╡ 0dd93204-b7c6-11ef-143e-2b7b182f8be1
md"""
### Code Example: Product of Two Gaussian PDFs

Let's plot the exact product of two Gaussian PDFs as well as the normalized product according to the above derivation.

"""

# ╔═╡ 0dd93236-b7c6-11ef-2656-b914f13c4ecd
using Plots, Distributions, LaTeXStrings
d1 = Normal(0, 1) # μ=0, σ^2=1
d2 = Normal(3, 2) # μ=3, σ^2=4

# Calculate the parameters of the product d1*d2
s2_prod = (d1.σ^-2 + d2.σ^-2)^-1
m_prod = s2_prod * ((d1.σ^-2)*d1.μ + (d2.σ^-2)*d2.μ)
d_prod = Normal(m_prod, sqrt(s2_prod)) # Note that we neglect the normalization constant.

# Plot stuff
x = range(-4, stop=8, length=100)
plot(x, pdf.(d1,x), label=L"\mathcal{N}(0,1)", fill=(0, 0.1))                                   # Plot the first Gaussian
plot!(x, pdf.(d2,x), label=L"\mathcal{N}(3,4)", fill=(0, 0.1))                                  # Plot the second Gaussian
plot!(x, pdf.(d1,x) .* pdf.(d2,x), label=L"\mathcal{N}(0,1) \mathcal{N}(3,4)", fill=(0, 0.1))   # Plot the exact product
plot!(x, pdf.(d_prod,x), label=L"Z^{-1} \mathcal{N}(0,1) \mathcal{N}(3,4)", fill=(0, 0.1))      # Plot the normalized Gaussian product


# ╔═╡ 0dd93f08-b7c6-11ef-3ad5-97d01baafa7c
md"""
### Bayesian Inference with multiple Observations

Now consider that we measure a data set ``D = \{x_1, x_2, \ldots, x_N\}``, with measurements

```math
\begin{aligned}
x_n &= \theta + \epsilon_n \\
\epsilon_n &\sim \mathcal{N}(0,\sigma^2)
\end{aligned}
```

and the same prior for ``\theta``:

```math
\theta \sim \mathcal{N}(\mu_0,\sigma_0^2) \\
```

Let's derive the distribution ``p(x_{N+1}|D)`` for the next sample . 

"""

# ╔═╡ 0dd94cb4-b7c6-11ef-0d42-5f5f3b071afa
md"""
##### inference

First, we derive the posterior for ``\theta``:

```math
\begin{align*}
p(\theta|D) \propto  \underbrace{\mathcal{N}(\theta|\mu_0,\sigma_0^2)}_{\text{prior}} \cdot \underbrace{\prod_{n=1}^N \mathcal{N}(x_n|\theta,\sigma^2)}_{\text{likelihood}}
\end{align*}
```

which is a multiplication of ``N+1`` Gaussians and is therefore also Gaussian-distributed.

"""

# ╔═╡ 0dd96092-b7c6-11ef-08b6-99348eca8529
md"""
Using the property that precisions and precision-weighted means add when Gaussians are multiplied, we can immediately write the posterior 

```math
p(\theta|D) = \mathcal{N} (\theta |\, \mu_N, \sigma_N^2)
```

as 

```math
\begin{align*}
  \frac{1}{\sigma_N^2}  &= \frac{1}{\sigma_0^2} + \sum_n  \frac{1}{\sigma^2} \qquad &\text{(B-2.142)} \\
  \mu_N   &= \sigma_N^2 \, \left( \frac{1}{\sigma_0^2} \mu_0 + \sum_n \frac{1}{\sigma^2} x_n  \right) \qquad &\text{(B-2.141)}
\end{align*}
```

"""

# ╔═╡ 0dd992ee-b7c6-11ef-3add-cdf7452bc514
md"""
##### application: prediction of future sample

We now have a posterior for the model parameters. Let's write down what we know about the next sample ``x_{N+1}``.

```math
\begin{align*}
p(x_{N+1}|D) &= \int p(x_{N+1}|\theta) p(\theta|D)\mathrm{d}\theta \\
  &= \int \mathcal{N}(x_{N+1}|\theta,\sigma^2) \mathcal{N}(\theta|\mu_N,\sigma^2_N) \mathrm{d}\theta \\
  &= \int \mathcal{N}(\theta|x_{N+1},\sigma^2) \mathcal{N}(\theta|\mu_N,\sigma^2_N) \mathrm{d}\theta \\
  &= \int  \mathcal{N}(x_{N+1}|\mu_N, \sigma^2_N +\sigma^2 ) \mathcal{N}(\theta|\cdot,\cdot)\mathrm{d}\theta \tag{use SRG-6} \\
  &= \mathcal{N}(x_{N+1}|\mu_N, \sigma^2_N +\sigma^2 ) \underbrace{\int \mathcal{N}(\theta|\cdot,\cdot)\mathrm{d}\theta}_{=1} \\
  &=\mathcal{N}(x_{N+1}|\mu_N, \sigma^2_N +\sigma^2 )
\end{align*}
```

"""

# ╔═╡ 0dd9a40a-b7c6-11ef-2864-8318d8f3d827
md"""
Uncertainty about ``x_{N+1}`` involved both uncertainty about the parameter (``\sigma_N^2``) and observation noise ``\sigma^2``.

"""

# ╔═╡ 0dd9b71a-b7c6-11ef-2c4a-a3f9e7f2bc87
md"""
### Maximum Likelihood Estimation for the Gaussian

In order to determine the *maximum likelihood* estimate of ``\theta``, we let ``\sigma_0^2 \rightarrow \infty`` (leads to uniform prior for ``\theta``), yielding $ \frac{1}{\sigma_N^2} = \frac{N}{\sigma^2}$ and consequently

```math
\begin{align*}
  \mu_{\text{ML}}  = \left.\mu_N\right\vert_{\sigma_0^2 \rightarrow \infty} = \sigma_N^2 \, \left(   \frac{1}{\sigma^2}\sum_n  x_n  \right) = \frac{1}{N} \sum_{n=1}^N x_n 
  \end{align*}
```

"""

# ╔═╡ 0dd9ccfa-b7c6-11ef-2379-2967a0b4ad07
md"""
As expected, having an expression for the maximum likelihood estimate, it is now possible to rewrite the (Bayesian) posterior mean for ``\theta`` as 

```math
\begin{align*}
  \underbrace{\mu_N}_{\text{posterior}}   &= \sigma_N^2 \, \left( \frac{1}{\sigma_0^2} \mu_0 + \sum_n \frac{1}{\sigma^2} x_n  \right) \\
  &= \frac{\sigma_0^2 \sigma^2}{N\sigma_0^2 + \sigma^2} \, \left( \frac{1}{\sigma_0^2} \mu_0 + \sum_n \frac{1}{\sigma^2} x_n  \right) \\
  &= \frac{ \sigma^2}{N\sigma_0^2 + \sigma^2}   \mu_0 + \frac{N \sigma_0^2}{N\sigma_0^2 + \sigma^2} \mu_{\text{ML}}   \\
  &= \underbrace{\mu_0}_{\text{prior}} + \underbrace{\underbrace{\frac{N \sigma_0^2}{N \sigma_0^2 + \sigma^2}}_{\text{gain}}\cdot \underbrace{\left(\mu_{\text{ML}} - \mu_0 \right)}_{\text{prediction error}}}_{\text{correction}}\tag{B-2.141}
\end{align*}
```

"""

# ╔═╡ 0dd9db78-b7c6-11ef-1005-73e5d7a4fc4b
md"""
Hence, the posterior mean always lies somewhere between the prior mean ``\mu_0`` and the maximum likelihood estimate (the "data" mean) ``\mu_{\text{ML}}``.

"""

# ╔═╡ 0dd9ed22-b7c6-11ef-19e5-038711d75259
md"""
### Conditioning and Marginalization of a Gaussian

Let ``z = \begin{bmatrix} x \\ y \end{bmatrix}`` be jointly normal distributed as

```math
\begin{align*}
p(z) &= \mathcal{N}(z | \mu, \Sigma) 
  =\mathcal{N} \left( \begin{bmatrix} x \\ y \end{bmatrix} \left| \begin{bmatrix} \mu_x \\ \mu_y \end{bmatrix}, 
  \begin{bmatrix} \Sigma_x & \Sigma_{xy} \\ \Sigma_{yx} & \Sigma_y \end{bmatrix} \right. \right)
\end{align*}
```

"""

# ╔═╡ 0dd9fb08-b7c6-11ef-0350-c529776149da
md"""
Since covariance matrices are by definition symmetric, it follows that ``\Sigma_x`` and ``\Sigma_y`` are symmetric and ``\Sigma_{xy} = \Sigma_{yx}^T``.

"""

# ╔═╡ 0dda09f4-b7c6-11ef-2429-377131c95b8e
md"""
Let's factorize ``p(z) = p(x,y)`` as ``p(x,y) = p(y|x) p(x)`` through conditioning and marginalization.

"""

# ╔═╡ 0dda16ce-b7c6-11ef-3b84-056673f08e89
md"""
```math
\begin{equation*}
\text{conditioning: }\boxed{ p(y|x) = \mathcal{N}\left(y\,|\,\mu_y + \Sigma_{yx}\Sigma_x^{-1}(x-\mu_x),\, \Sigma_y - \Sigma_{yx}\Sigma_x^{-1}\Sigma_{xy} \right)}
\end{equation*}
```

"""

# ╔═╡ 0dda22f4-b7c6-11ef-05ec-ef5e23c533a1
md"""
```math
\begin{equation*}
\text{marginalization: } \boxed{ p(x) = \mathcal{N}\left( x|\mu_x, \Sigma_x \right)}
\end{equation*}
```

"""

# ╔═╡ 0dda301e-b7c6-11ef-0188-0d6a9782abfa
md"""
**proof**: in Bishop pp.87-89

"""

# ╔═╡ 0dda3d8e-b7c6-11ef-0e2e-9942afc06c32
md"""
Hence, conditioning and marginalization in Gaussians leads to Gaussians again. This is very useful for applications to Bayesian inference in jointly Gaussian systems.

"""

# ╔═╡ 0dda4b3a-b7c6-11ef-17c2-5f5ccd912eee
md"""
With a natural parameterization of the Gaussian ``p(z) = \mathcal{N}_c(z|\eta,\Lambda)`` with precision matrix ``\Lambda = \Sigma^{-1} = \begin{bmatrix} \Lambda_x & \Lambda_{xy} \\ \Lambda_{yx} & \Lambda_y \end{bmatrix}``,  the conditioning operation results in a simpler result, see Bishop pg.90, eqs. 2.96 and 2.97. 

"""

# ╔═╡ 0dda6b2e-b7c6-11ef-14ee-25d9a3acaf11
md"""
As an exercise, interpret the formula for the conditional mean (``\mathbb{E}[y|x]=\mu_y + \Sigma_{yx}\Sigma_x^{-1}(x-\mu_x)``) as a prediction-correction operation.

"""

# ╔═╡ 0dda770e-b7c6-11ef-2988-397f0085c3a3
md"""
### Code Example: Joint, Marginal, and Conditional Gaussian Distributions

Let's plot of the joint, marginal, and conditional distributions.

"""

# ╔═╡ 0dda774a-b7c6-11ef-2750-4960eef0932b
using Plots, LaTeXStrings, Distributions

# Define the joint distribution p(x,y)
μ = [1.0; 2.0]
Σ = [0.3 0.7;
     0.7 2.0]
joint = MvNormal(μ,Σ)

# Define the marginal distribution p(x)
marginal_x = Normal(μ[1], sqrt(Σ[1,1]))

# Plot p(x,y)
x_range = y_range = range(-2,stop=5,length=1000)
joint_pdf = [ pdf(joint, [x_range[i];y_range[j]]) for  j=1:length(y_range), i=1:length(x_range)]
plot_1 = heatmap(x_range, y_range, joint_pdf, title = L"p(x, y)")

# Plot p(x)
plot_2 = plot(range(-2,stop=5,length=1000), pdf.(marginal_x, range(-2,stop=5,length=1000)), title = L"p(x)", label="", fill=(0, 0.1))

# Plot p(y|x = 0.1)
x = 0.1
conditional_y_m = μ[2]+Σ[2,1]*inv(Σ[1,1])*(x-μ[1])
conditional_y_s2 = Σ[2,2] - Σ[2,1]*inv(Σ[1,1])*Σ[1,2]
conditional_y = Normal(conditional_y_m, sqrt.(conditional_y_s2))
plot_3 = plot(range(-2,stop=5,length=1000), pdf.(conditional_y, range(-2,stop=5,length=1000)), title = L"p(y|x = %$x)", label="", fill=(0, 0.1))
plot(plot_1, plot_2, plot_3, layout=(1,3), size=(1200,300))


# ╔═╡ 0dda842e-b7c6-11ef-24b6-19e2fad91333
md"""
As is clear from the plots, the conditional distribution is a renormalized slice from the joint distribution.

"""

# ╔═╡ 0dda9086-b7c6-11ef-2455-732cd6d69407
md"""
### Example: Conditioning of Gaussian

Consider (again) the system 

```math
\begin{align*}
p(x\,|\,\theta) &= \mathcal{N}(x\,|\,\theta,\sigma^2) \\
p(\theta) &= \mathcal{N}(\theta\,|\,\mu_0,\sigma_0^2)
\end{align*}
```

"""

# ╔═╡ 0dda9d36-b7c6-11ef-1ab4-7b341b8cfcdf
md"""
Let ``z = \begin{bmatrix} x \\ \theta \end{bmatrix}``. The distribution for ``z`` is then given by (Exercise)

```math
p(z) = p\left(\begin{bmatrix} x \\ \theta \end{bmatrix}\right) = \mathcal{N} \left( \begin{bmatrix} x\\ 
  \theta  \end{bmatrix} 
  \,\left|\, \begin{bmatrix} \mu_0\\ 
  \mu_0\end{bmatrix}, 
         \begin{bmatrix} \sigma_0^2+\sigma^2  & \sigma_0^2\\ 
         \sigma_0^2 &\sigma_0^2 
  \end{bmatrix} 
  \right. \right)
```

"""

# ╔═╡ 0ddaa9f4-b7c6-11ef-01a0-a78e551e6414
md"""
Direct substitution of the rule for Gaussian conditioning leads to the <a id="precision-weighted-update">posterior</a> (derivation as an Exercise):

```math
\begin{align*}
p(\theta|x) &= \mathcal{N} \left( \theta\,|\,\mu_1, \sigma_1^2 \right)\,,
\end{align*}
```

with

```math
\begin{align*}
K &= \frac{\sigma_0^2}{\sigma_0^2+\sigma^2} \qquad \text{($K$ is called: Kalman gain)}\\
\mu_1 &= \mu_0 + K \cdot (x-\mu_0)\\
\sigma_1^2 &= \left( 1-K \right) \sigma_0^2  
\end{align*}
```

"""

# ╔═╡ 0ddab62e-b7c6-11ef-1b65-df9e3d1087d6
md"""
```math
\Rightarrow
```

Moral: For jointly Gaussian systems, we can do inference simply in one step by using the formulas for conditioning and marginalization.

"""

# ╔═╡ 0ddae00e-b7c6-11ef-33f2-b565ce8fc3ba
md"""
### Recursive Bayesian Estimation for Adaptive Signal Processing

Consider the signal ``x_t=\theta+\epsilon_t``, where ``D_t= \left\{x_1,\ldots,x_t\right\}`` is observed *sequentially* (over time).

**Problem**: Derive a recursive algorithm for ``p(\theta|D_t)``, i.e., an update rule for (posterior) ``p(\theta|D_t)`` based on (prior) ``p(\theta|D_{t-1})`` and (new observation) ``x_t``.

"""

# ╔═╡ 0ddafb7a-b7c6-11ef-3c3f-c9fa7af39c92
md"""
##### Model specification

Let's define the estimate after ``t`` observations (i.e., our *solution* ) as ``p(\theta|D_t) = \mathcal{N}(\theta\,|\,\mu_t,\sigma_t^2)``.

We define the joint distribution for ``\theta`` and ``x_t``, given background ``D_{t-1}``, by

```math
\begin{align*} p(x_t,\theta \,|\, D_{t-1}) &= p(x_t|\theta) \, p(\theta|D_{t-1}) \\
  &= \underbrace{\mathcal{N}(x_t\,|\, \theta,\sigma^2)}_{\text{likelihood}} \, \underbrace{\mathcal{N}(\theta\,|\,\mu_{t-1},\sigma_{t-1}^2)}_{\text{prior}}
\end{align*}
```

"""

# ╔═╡ 0ddb085c-b7c6-11ef-34fd-6b1b18a95ff1
md"""
##### Inference

Use Bayes rule,

```math
\begin{align*}
p(\theta|D_t) &= p(\theta|x_t,D_{t-1}) \\
  &\propto p(x_t,\theta | D_{t-1}) \\
  &= p(x_t|\theta) \, p(\theta|D_{t-1}) \\
  &= \mathcal{N}(x_t|\theta,\sigma^2) \, \mathcal{N}(\theta\,|\,\mu_{t-1},\sigma_{t-1}^2) \\
  &= \mathcal{N}(\theta|x_t,\sigma^2) \, \mathcal{N}(\theta\,|\,\mu_{t-1},\sigma_{t-1}^2) \;\;\text{(note this trick)}\\
  &= \mathcal{N}(\theta|\mu_t,\sigma_t^2) \;\;\text{(use Gaussian multiplication formula SRG-6)}
\end{align*}
```

with

```math
\begin{align*}
K_t &= \frac{\sigma_{t-1}^2}{\sigma_{t-1}^2+\sigma^2} \qquad \text{(Kalman gain)}\\
\mu_t &= \mu_{t-1} + K_t \cdot (x_t-\mu_{t-1})\\
\sigma_t^2 &= \left( 1-K_t \right) \sigma_{t-1}^2 
\end{align*}
```

"""

# ╔═╡ 0ddb163a-b7c6-11ef-2b06-a1d6677b7191
md"""
This linear *sequential* estimator of mean and variance in Gaussian observations is called a **Kalman Filter**.

<!–- - The new observation ``x_t`` 'updates' the old estimate ``\mu_{t-1}`` by a quantity that is proportional to the *innovation* (or *residual*)  ``\left( x_t - \mu_{t-1} \right)``. –-> 

"""

# ╔═╡ 0ddb2302-b7c6-11ef-1f50-27711dbe4d33
md"""
The so-called Kalman gain ``K_t`` serves as a "learning rate" (step size) in the parameter update equation ``\mu_t = \mu_{t-1} + K_t \cdot (x_t-\mu_{t-1})``. Note that *you* don't need to choose the learning rate. Bayesian inference computes its own (optimal) learning rates.  

"""

# ╔═╡ 0ddb2fa0-b7c6-11ef-3ac5-8979f2a0a00c
md"""
Note that the uncertainty about ``\theta`` decreases over time (since ``0<(1-K_t)<1``). If we assume that the statistics of the system do not change (stationarity), each new sample provides new information about the process, so the uncertainty decreases. 

"""

# ╔═╡ 0ddb3c34-b7c6-11ef-2a77-895cbc5796f3
md"""
Recursive Bayesian estimation as discussed here is the basis for **adaptive signal processing** algorithms such as Least Mean Squares (LMS) and Recursive Least Squares (RLS). Both RLS and LMS are special cases of Recursive Bayesian estimation.

"""

# ╔═╡ 0ddb4b54-b7c6-11ef-121d-5d00e547debd
md"""
### Code Example: Kalman Filter

Let's implement the Kalman filter described above. We'll use it to recursively estimate the value of ``\theta`` based on noisy observations.

"""

# ╔═╡ 0ddb4bb4-b7c6-11ef-373a-ab345190363a
using Plots, Distributions

n = 100         # specify number of observations
θ = 2.0         # true value of the parameter we would like to estimate
noise_σ2 = 0.3  # variance of observation noise

observations = noise_σ2 * randn(n) .+ θ

function perform_kalman_step(prior :: Normal, x :: Float64, noise_σ2 :: Float64)
    K = prior.σ / (noise_σ2 + prior.σ)          # compute the Kalman gain
    posterior_μ = prior.μ + K*(x - prior.μ)     # update the posterior mean
    posterior_σ = prior.σ * (1.0 - K)           # update the posterior standard deviation
    return Normal(posterior_μ, posterior_σ)     # return the posterior distribution
end

post_μ = fill!(Vector{Float64}(undef,n + 1), NaN)     # means of p(θ|D) over time
post_σ2 = fill!(Vector{Float64}(undef,n + 1), NaN)    # variances of p(θ|D) over time

prior = Normal(0, 1)    # specify the prior distribution (you can play with the parameterization of this to get a feeling of how the Kalman filter converges)

post_μ[1] = prior.μ     # save prior mean and variance to show these in plot
post_σ2[1] = prior.σ

for (i, x) in enumerate(observations)                           # note that this loop demonstrates Bayesian learning on streaming data; we update the prior distribution using observation(s), after which this posterior becomes the new prior for future observations
    posterior = perform_kalman_step(prior, x, noise_σ2)         # compute the posterior distribution given the observation
    post_μ[i + 1] = posterior.μ                                 # save the mean of the posterior distribution
    post_σ2[i + 1] = posterior.σ                                # save the variance of the posterior distribution
    prior = posterior                                           # the posterior becomes the prior for future observations
end

obs_scale = collect(2:n+1)
scatter(obs_scale, observations, label=L"D", )  
post_scale = collect(1:n+1)                                                         # scatter the observations
plot!(post_scale, post_μ, ribbon=sqrt.(post_σ2), linewidth=3, label=L"p(θ | D_t)")  # lineplot our estimated means of intermediate posterior distributions
plot!(post_scale, θ*ones(n + 1), linewidth=2, label=L"θ")                           # plot the true value of θ


# ╔═╡ 0ddb7294-b7c6-11ef-0585-3f1a218aeb42
md"""
The shaded area represents 2 standard deviations of posterior ``p(\theta|D)``. The variance of the posterior is guaranteed to decrease monotonically for the standard Kalman filter.

"""

# ╔═╡ 0ddb9904-b7c6-11ef-3808-35b8ee37dd04
md"""
### <a id="product-of-gaussians">Product of Normally Distributed Variables</a>

(We've seen that) the sum of two Gausssian distributed variables is also Gaussian distributed.

"""

# ╔═╡ 0ddba9ee-b7c6-11ef-3148-9db5fbb13d77
md"""
Has the *product* of two Gaussian distributed variables also a Gaussian distribution?

"""

# ╔═╡ 0ddbba2e-b7c6-11ef-04cf-1119024af1d1
md"""
**No**! In general this is a difficult computation. As an example, let's compute ``p(z)`` for ``Z=XY`` for the special case that ``X\sim \mathcal{N}(0,1)`` and ``Y\sim \mathcal{N}(0,1)``.

```math
\begin{align*}
p(z) &= \int_{X,Y} p(z|x,y)\,p(x,y)\,\mathrm{d}x\mathrm{d}y \\
  &= \frac{1}{2 \pi}\int  \delta(z-xy) \, e^{-(x^2+y^2)/2} \, \mathrm{d}x\mathrm{d}y \\
  &=  \frac{1}{\pi} \int_0^\infty \frac{1}{x} e^{-(x^2+z^2/x^2)/2} \, \mathrm{d}x \\
  &= \frac{1}{\pi} \mathrm{K}_0( \lvert z\rvert )\,.
\end{align*}
```

where  ``\mathrm{K}_n(z)`` is a [modified Bessel function of the second kind](http://mathworld.wolfram.com/ModifiedBesselFunctionoftheSecondKind.html).

"""

# ╔═╡ 0ddbc78a-b7c6-11ef-2ce4-f76fa4153e4b
md"""
### Code Example: Product of Gaussian Distributions

We plot ``p(Z=XY)`` and ``p(X)p(Y)`` for ``X\sim\mathcal{N}(0,1)`` and ``Y \sim \mathcal{N}(0,1)`` to give an idea of how these distributions differ.

"""

# ╔═╡ 0ddbc7c8-b7c6-11ef-004f-8bfaa5f29eba
using Plots, Distributions, SpecialFunctions, LaTeXStrings
X = Normal(0,1)
Y = Normal(0,1)
pdf_product_std_normals(z::Vector) = (besselk.(0, abs.(z))./π)
range1 = collect(range(-4,stop=4,length=100))
plot(range1, pdf.(X, range1), label=L"p(X)=p(Y)=\mathcal{N}(0,1)", fill=(0, 0.1))
plot!(range1, pdf.(X,range1).*pdf.(Y,range1), label=L"p(X)*p(Y)", fill=(0, 0.1))
plot!(range1, pdf_product_std_normals(range1), label=L"p(Z=X*Y)", fill=(0, 0.1))

# ╔═╡ 0ddbd3ce-b7c6-11ef-20e1-070d736f7b95
md"""
In short, Gaussian-distributed variables remain Gaussian in linear systems, but this is not the case in non-linear systems. 

"""

# ╔═╡ 0ddbf246-b7c6-11ef-16a5-bbf396f80915
md"""
### Solution to Example Problem

We apply maximum likelihood estimation to fit a 2-dimensional Gaussian model (``m``) to data set ``D``. Next, we evaluate ``p(x_\bullet \in S | m)`` by (numerical) integration of the Gaussian pdf over ``S``: ``p(x_\bullet \in S | m) = \int_S p(x|m) \mathrm{d}x``.

"""

# ╔═╡ 0ddbf278-b7c6-11ef-20f5-7ffd3163b14f
using HCubature, LinearAlgebra, Plots, Distributions# Numerical integration package
# Maximum likelihood estimation of 2D Gaussian
N = length(sum(D,dims=1))
μ = 1/N * sum(D,dims=2)[:,1]
D_min_μ = D - repeat(μ, 1, N)
Σ = Hermitian(1/N * D_min_μ*D_min_μ')
m = MvNormal(μ, convert(Matrix, Σ));

contour(range(-3, 4, length=100), range(-3, 4, length=100), (x, y) -> pdf(m, [x, y]))

# Numerical integration of p(x|m) over S:
(val,err) = hcubature((x)->pdf(m,x), [0., 1.], [2., 2.])
println("p(x⋅∈S|m) ≈ $(val)")

scatter!(D[1,:], D[2,:], marker=:x, markerstrokewidth=3, label=L"D")
scatter!([x_dot[1]], [x_dot[2]], label=L"x_\bullet")
plot!(range(0, 2), [1., 1., 1.], fillrange=2, alpha=0.4, color=:gray, label=L"S")

# ╔═╡ 0ddc02d6-b7c6-11ef-284e-018c7895536e
md"""
### Summary

A **linear transformation** ``z=Ax+b`` of a Gaussian variable ``x \sim \mathcal{N}(\mu_x,\Sigma_x)`` is Gaussian distributed as

```math
p(z) = \mathcal{N} \left(z \,|\, A\mu_x+b, A\Sigma_x A^T \right) 
```

Bayesian inference with a Gaussian prior and Gaussian likelihood leads to an analytically computable Gaussian posterior, because of the **multiplication rule for Gaussians**:

```math
\begin{equation*}
\mathcal{N}(x|\mu_a,\Sigma_a) \cdot \mathcal{N}(x|\mu_b,\Sigma_b) = \underbrace{\mathcal{N}(\mu_a|\, \mu_b, \Sigma_a + \Sigma_b)}_{\text{normalization constant}} \cdot \mathcal{N}(x|\mu_c,\Sigma_c)
\end{equation*}
```

where

```math
\begin{align*}
\Sigma_c^{-1} &= \Sigma_a^{-1} + \Sigma_b^{-1} \\
\Sigma_c^{-1} \mu_c &= \Sigma_a^{-1}\mu_a + \Sigma_b^{-1}\mu_b
\end{align*}
```

**Conditioning and marginalization** of a multivariate Gaussian distribution yields Gaussian distributions. In particular, the joint distribution

```math
\mathcal{N} \left( \begin{bmatrix} x \\ y \end{bmatrix} \left| \begin{bmatrix} \mu_x \\ \mu_y \end{bmatrix}, 
  \begin{bmatrix} \Sigma_x & \Sigma_{xy} \\ \Sigma_{yx} & \Sigma_y \end{bmatrix} \right. \right)
```

can be decomposed as

```math
\begin{align*}
 p(y|x) &= \mathcal{N}\left(y\,|\,\mu_y + \Sigma_{yx}\Sigma_x^{-1}(x-\mu_x),\, \Sigma_y - \Sigma_{yx}\Sigma_x^{-1}\Sigma_{xy} \right) \\
p(x) &= \mathcal{N}\left( x|\mu_x, \Sigma_x \right)
\end{align*}
```

Here's a nice [summary of Gaussian calculations](https://github.com/bertdv/AIP-5SSB0/raw/master/lessons/notebooks/files/RoweisS-gaussian_formulas.pdf) by Sam Roweis. 

"""

# ╔═╡ 0ddc1028-b7c6-11ef-1eec-6d72e52f4431
md"""
## <center> OPTIONAL SLIDES</center>

"""

# ╔═╡ 0ddc1c2e-b7c6-11ef-00b6-e98913a96420
md"""
### <a id="inference-for-precision">Inference for the Precision Parameter of the Gaussian</a>

Again, we consider an observed data set ``D = \{x_1, x_2, \ldots, x_N\}`` and try to explain these data by a Gaussian distribution.

"""

# ╔═╡ 0ddc287e-b7c6-11ef-1f72-910e6e7b06bb
md"""
We discussed earlier Bayesian inference for the mean with a given variance. Now we will derive a posterior for the variance if the mean is given. (Technically, we will do the derivation for a precision parameter ``\lambda = \sigma^{-2}``, since the discussion is a bit more straightforward for the precision parameter).

"""

# ╔═╡ 0ddc367a-b7c6-11ef-38f9-09fb462987dc
md"""
##### model specification

The likelihood for the precision parameter is 

```math
\begin{align*}
p(D|\lambda) &= \prod_{n=1}^N \mathcal{N}\left(x_n \,|\, \mu, \lambda^{-1} \right) \\
  &\propto \lambda^{N/2} \exp\left\{ -\frac{\lambda}{2}\sum_{n=1}^N \left(x_n - \mu \right)^2\right\} \tag{B-2.145}
\end{align*}
```

"""

# ╔═╡ 0ddc4796-b7c6-11ef-2156-8b3d6899a8c0
md"""
The conjugate distribution for this function of ``\lambda`` is the [*Gamma* distribution](https://en.wikipedia.org/wiki/Gamma_distribution), given by

```math
p(\lambda\,|\,a,b) = \mathrm{Gam}\left( \lambda\,|\,a,b \right) \triangleq \frac{1}{\Gamma(a)} b^{a} \lambda^{a-1} \exp\left\{ -b \lambda\right\}\,, \tag{B-2.146}
```

where ``a>0`` and ``b>0`` are known as the *shape* and *rate* parameters, respectively. 

<img src="./figures/B-fig-2.13.png" width="600px">

(Bishop fig.2.13). Plots of the Gamma distribution ``\mathrm{Gam}\left( \lambda\,|\,a,b \right) $ for different values of $a`` and ``b``.

"""

# ╔═╡ 0ddc55f6-b7c6-11ef-3975-9d92d7e3feca
md"""
The mean and variance of the Gamma distribution evaluate to ``\mathrm{E}\left( \lambda\right) = \frac{a}{b}`` and ``\mathrm{var}\left[\lambda\right] = \frac{a}{b^2}``. 

"""

# ╔═╡ 0ddc7284-b7c6-11ef-0c0c-bd949a5ef015
md"""
##### inference

We will consider a prior ``p(\lambda) = \mathrm{Gam}\left( \lambda\,|\,a_0, b_0\right)``, which leads by Bayes rule to the posterior

```math
\begin{align*}
p(\lambda\,|\,D) &\propto \underbrace{\lambda^{N/2} \exp\left\{ -\frac{\lambda}{2}\sum_{n=1}^N \left(x_n - \mu \right)^2\right\} }_{\text{likelihood}} \cdot \underbrace{\frac{1}{\Gamma(a_0)} b_0^{a_0} \lambda^{a_0-1} \exp\left\{ -b_0 \lambda\right\}}_{\text{prior}} \\
  &\propto \mathrm{Gam}\left( \lambda\,|\,a_N,b_N \right) 
\end{align*}
```

with

```math
\begin{align*}
a_N &= a_0 + \frac{N}{2} \qquad &&\text{(B-2.150)} \\
b_N &= b_0 + \frac{1}{2}\sum_n \left( x_n-\mu\right)^2 \qquad &&\text{(B-2.151)}
\end{align*}
```

"""

# ╔═╡ 0ddc7f40-b7c6-11ef-0253-6342085f708a
md"""
Hence the **posterior is again a Gamma distribution**. By inspection of B-2.150 and B-2.151, we deduce that we can interpret ``2a_0`` as the number of a priori (pseudo-)observations. 

"""

# ╔═╡ 0ddc8b70-b7c6-11ef-13cb-3daa72032cf9
md"""
Since the most uninformative prior is given by ``a_0=b_0 \rightarrow 0``, we can derive the **maximum likelihood estimate** for the precision as

```math
\lambda_{\text{ML}} = \left.\mathrm{E}\left[ \lambda\right]\right\vert_{a_0=b_0\rightarrow 0} = \left. \frac{a_N}{b_N}\right\vert_{a_0=b_0\rightarrow 0} = \frac{N}{\sum_{n=1}^N \left(x_n-\mu \right)^2}
```

"""

# ╔═╡ 0ddc9aac-b7c6-11ef-3d8e-f5a5d0e715f8
md"""
In short, if we do density estimation with a Gaussian distribution ``\mathcal{N}\left(x_n\,|\,\mu,\sigma^2 \right)`` for an observed data set ``D = \{x_1, x_2, \ldots, x_N\}``, the <a id="ML-for-Gaussian">maximum likelihood estimates</a> for ``\mu`` and ``\sigma^2`` are given by

```math
\begin{align*}
\mu_{\text{ML}} &= \frac{1}{N} \sum_{n=1}^N x_n \qquad &&\text{(B-2.121)} \\
\sigma^2_{\text{ML}} &= \frac{1}{N} \sum_{n=1}^N \left(x_n - \mu_{\text{ML}} \right)^2 \qquad &&\text{(B-2.122)}
\end{align*}
```

These estimates are also known as the *sample mean* and *sample variance* respectively. 

"""

# ╔═╡ 0ddc9ae8-b7c6-11ef-33a5-771f934e6ae8
open("../../styles/aipstyle.html") do f
    display("text/html", read(f, String))
end

# ╔═╡ Cell order:
# ╟─0dd544c8-b7c6-11ef-106b-99e6f84894c3
# ╟─0dd817c0-b7c6-11ef-1f8b-ff0f59f7a8ce
# ╟─0dd82814-b7c6-11ef-3927-b3ec0b632c31
# ╠═0dd82864-b7c6-11ef-097a-b5861a1f8411
# ╠═0dd8288c-b7c6-11ef-347d-f55f7ef817d2
# ╟─0dd835ca-b7c6-11ef-0e33-1329e4ba13d8
# ╟─0dd84542-b7c6-11ef-3115-0f8b26aeaa5d
# ╟─0dd8528a-b7c6-11ef-3bc9-eb09c0c530d8
# ╟─0dd85c94-b7c6-11ef-06dc-7b8797c13fda
# ╟─0dd8677a-b7c6-11ef-357f-2328b10f5274
# ╟─0dd86f40-b7c6-11ef-2ae8-a3954469bcee
# ╟─0dd87a3a-b7c6-11ef-2bc2-bf2b4969537c
# ╟─0dd88110-b7c6-11ef-0b82-2ffe13a68cad
# ╟─0dd88a84-b7c6-11ef-133c-3d85f0703c19
# ╟─0dd890ee-b7c6-11ef-04b7-e7671227d8cb
# ╟─0dd89b6e-b7c6-11ef-2525-73ee0242eb91
# ╟─0dd8b5d6-b7c6-11ef-1eb9-4f4289261e79
# ╟─0dd8c024-b7c6-11ef-3ca4-f9e8286cbb64
# ╟─0dd8d976-b7c6-11ef-051f-4f6cb3db3d1b
# ╟─0dd8df66-b7c6-11ef-011a-8d90bba8e2cd
# ╟─0dd8ea56-b7c6-11ef-0116-691b99023eb5
# ╟─0dd8f1fe-b7c6-11ef-3386-e37f33577577
# ╟─0dd8fbe2-b7c6-11ef-1f78-63dfd48146fd
# ╟─0dd90644-b7c6-11ef-2fcf-2948d45f43bb
# ╟─0dd91b7a-b7c6-11ef-1326-7bbfe5ac16bf
# ╟─0dd9264e-b7c6-11ef-0fa9-d3e4e5053654
# ╟─0dd93204-b7c6-11ef-143e-2b7b182f8be1
# ╠═0dd93236-b7c6-11ef-2656-b914f13c4ecd
# ╟─0dd93f08-b7c6-11ef-3ad5-97d01baafa7c
# ╟─0dd94cb4-b7c6-11ef-0d42-5f5f3b071afa
# ╟─0dd96092-b7c6-11ef-08b6-99348eca8529
# ╟─0dd992ee-b7c6-11ef-3add-cdf7452bc514
# ╟─0dd9a40a-b7c6-11ef-2864-8318d8f3d827
# ╟─0dd9b71a-b7c6-11ef-2c4a-a3f9e7f2bc87
# ╟─0dd9ccfa-b7c6-11ef-2379-2967a0b4ad07
# ╟─0dd9db78-b7c6-11ef-1005-73e5d7a4fc4b
# ╟─0dd9ed22-b7c6-11ef-19e5-038711d75259
# ╟─0dd9fb08-b7c6-11ef-0350-c529776149da
# ╟─0dda09f4-b7c6-11ef-2429-377131c95b8e
# ╟─0dda16ce-b7c6-11ef-3b84-056673f08e89
# ╟─0dda22f4-b7c6-11ef-05ec-ef5e23c533a1
# ╟─0dda301e-b7c6-11ef-0188-0d6a9782abfa
# ╟─0dda3d8e-b7c6-11ef-0e2e-9942afc06c32
# ╟─0dda4b3a-b7c6-11ef-17c2-5f5ccd912eee
# ╟─0dda6b2e-b7c6-11ef-14ee-25d9a3acaf11
# ╟─0dda770e-b7c6-11ef-2988-397f0085c3a3
# ╠═0dda774a-b7c6-11ef-2750-4960eef0932b
# ╟─0dda842e-b7c6-11ef-24b6-19e2fad91333
# ╟─0dda9086-b7c6-11ef-2455-732cd6d69407
# ╟─0dda9d36-b7c6-11ef-1ab4-7b341b8cfcdf
# ╟─0ddaa9f4-b7c6-11ef-01a0-a78e551e6414
# ╟─0ddab62e-b7c6-11ef-1b65-df9e3d1087d6
# ╟─0ddae00e-b7c6-11ef-33f2-b565ce8fc3ba
# ╟─0ddafb7a-b7c6-11ef-3c3f-c9fa7af39c92
# ╟─0ddb085c-b7c6-11ef-34fd-6b1b18a95ff1
# ╟─0ddb163a-b7c6-11ef-2b06-a1d6677b7191
# ╟─0ddb2302-b7c6-11ef-1f50-27711dbe4d33
# ╟─0ddb2fa0-b7c6-11ef-3ac5-8979f2a0a00c
# ╟─0ddb3c34-b7c6-11ef-2a77-895cbc5796f3
# ╟─0ddb4b54-b7c6-11ef-121d-5d00e547debd
# ╠═0ddb4bb4-b7c6-11ef-373a-ab345190363a
# ╟─0ddb7294-b7c6-11ef-0585-3f1a218aeb42
# ╟─0ddb9904-b7c6-11ef-3808-35b8ee37dd04
# ╟─0ddba9ee-b7c6-11ef-3148-9db5fbb13d77
# ╟─0ddbba2e-b7c6-11ef-04cf-1119024af1d1
# ╟─0ddbc78a-b7c6-11ef-2ce4-f76fa4153e4b
# ╠═0ddbc7c8-b7c6-11ef-004f-8bfaa5f29eba
# ╟─0ddbd3ce-b7c6-11ef-20e1-070d736f7b95
# ╟─0ddbf246-b7c6-11ef-16a5-bbf396f80915
# ╠═0ddbf278-b7c6-11ef-20f5-7ffd3163b14f
# ╟─0ddc02d6-b7c6-11ef-284e-018c7895536e
# ╟─0ddc1028-b7c6-11ef-1eec-6d72e52f4431
# ╟─0ddc1c2e-b7c6-11ef-00b6-e98913a96420
# ╟─0ddc287e-b7c6-11ef-1f72-910e6e7b06bb
# ╟─0ddc367a-b7c6-11ef-38f9-09fb462987dc
# ╟─0ddc4796-b7c6-11ef-2156-8b3d6899a8c0
# ╟─0ddc55f6-b7c6-11ef-3975-9d92d7e3feca
# ╟─0ddc7284-b7c6-11ef-0c0c-bd949a5ef015
# ╟─0ddc7f40-b7c6-11ef-0253-6342085f708a
# ╟─0ddc8b70-b7c6-11ef-13cb-3daa72032cf9
# ╟─0ddc9aac-b7c6-11ef-3d8e-f5a5d0e715f8
# ╠═0ddc9ae8-b7c6-11ef-33a5-771f934e6ae8