-
Notifications
You must be signed in to change notification settings - Fork 0
/
Gaussian distribution.jl
1151 lines (844 loc) · 41.6 KB
/
Gaussian distribution.jl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
### A Pluto.jl notebook ###
# v0.20.4
using Markdown
using InteractiveUtils
# ╔═╡ 0dd544c8-b7c6-11ef-106b-99e6f84894c3
md"""
# Continuous Data and the Gaussian Distribution
"""
# ╔═╡ 0dd817c0-b7c6-11ef-1f8b-ff0f59f7a8ce
md"""
### Preliminaries
Goal
* Review of information processing with Gaussian distributions in linear systems
Materials
* Mandatory
* These lecture notes
* Optional
* Bishop pp. 85-93
* [MacKay - 2006 - The Humble Gaussian Distribution](https://github.com/bertdv/BMLIP/blob/master/lessons/notebooks/files/Mackay-2006-The-humble-Gaussian-distribution.pdf) (highly recommended!)
* [Ariel Caticha - 2012 - Entropic Inference and the Foundations of Physics](https://github.com/bertdv/BMLIP/blob/master/lessons/notebooks/files/Caticha-2012-Entropic-Inference-and-the-Foundations-of-Physics.pdf), pp.30-34, section 2.8, the Gaussian distribution
* References
* [E.T. Jaynes - 2003 - Probability Theory, The Logic of Science](http://www.med.mcgill.ca/epidemiology/hanley/bios601/GaussianModel/JaynesProbabilityTheory.pdf) (best book available on the Bayesian view on probability theory)
"""
# ╔═╡ 0dd82814-b7c6-11ef-3927-b3ec0b632c31
md"""
### Example Problem
Consider a set of observations ``D=\{x_1,…,x_N\}`` in the 2-dimensional plane (see Figure). All observations were generated by the same process. We now draw an extra observation ``x_\bullet = (a,b)`` from the same data generating process. What is the probability that ``x_\bullet`` lies within the shaded rectangle ``S``?
"""
# ╔═╡ 0dd82864-b7c6-11ef-097a-b5861a1f8411
using Pkg; Pkg.activate("../."); Pkg.instantiate();
using IJulia; try IJulia.clear_output(); catch _ end
# ╔═╡ 0dd8288c-b7c6-11ef-347d-f55f7ef817d2
using Distributions, Plots, LaTeXStrings
N = 100
generative_dist = MvNormal([0,1.], [0.8 0.5; 0.5 1.0])
D = rand(generative_dist, N) # Generate observations from generative_dist
scatter(D[1,:], D[2,:], marker=:x, markerstrokewidth=3, label=L"D")
x_dot = rand(generative_dist) # Generate x∙
scatter!([x_dot[1]], [x_dot[2]], label=L"x_\bullet")
plot!(range(0, 2), [1., 1., 1.], fillrange=2, alpha=0.4, color=:gray,label=L"S")
# ╔═╡ 0dd835ca-b7c6-11ef-0e33-1329e4ba13d8
md"""
### The Gaussian Distribution
Consider a random (vector) variable ``x \in \mathbb{R}^M`` that is "normally" (i.e., Gaussian) distributed. The *moment* parameterization of the Gaussian distribution is completely specified by its *mean* ``\mu`` and *variance* ``\Sigma`` and given by
```math
p(x | \mu, \Sigma) = \mathcal{N}(x|\mu,\Sigma) \triangleq \frac{1}{\sqrt{(2\pi)^M |\Sigma|}} \,\exp\left\{-\frac{1}{2}(x-\mu)^T \Sigma^{-1} (x-\mu) \right\}\,.
```
where ``|\Sigma| \triangleq \mathrm{det}(\Sigma)`` is the determinant of ``\Sigma``.
For the scalar real variable ``x \in \mathbb{R}``, this works out to
```math
p(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2 }} \,\exp\left\{-\frac{(x-\mu)^2}{2 \sigma^2} \right\}\,.
```
"""
# ╔═╡ 0dd84542-b7c6-11ef-3115-0f8b26aeaa5d
md"""
Alternatively, the <a id="natural-parameterization">*canonical* (a.k.a. *natural* or *information* ) parameterization</a> of the Gaussian distribution is given by
```math
\begin{equation*}
p(x | \eta, \Lambda) = \mathcal{N}_c(x|\eta,\Lambda) = \exp\left\{ a + \eta^T x - \frac{1}{2}x^T \Lambda x \right\}\,.
\end{equation*}
```
```math
a = -\frac{1}{2} \left( M \log(2 \pi) - \log |\Lambda| + \eta^T \Lambda \eta\right)
```
is the normalizing constant that ensures that ``\int p(x)\mathrm{d}x = 1``.
```math
\Lambda = \Sigma^{-1}
```
is called the *precision matrix*.
```math
\eta = \Sigma^{-1} \mu
```
is the *natural* mean or for clarity often called the *precision-weighted* mean.
"""
# ╔═╡ 0dd8528a-b7c6-11ef-3bc9-eb09c0c530d8
md"""
### Why the Gaussian?
Why is the Gaussian distribution so ubiquitously used in science and engineering? (see also [Jaynes, section 7.14](http://www.med.mcgill.ca/epidemiology/hanley/bios601/GaussianModel/JaynesProbabilityTheory.pdf#page=250), and the whole chapter 7 in his book).
"""
# ╔═╡ 0dd85c94-b7c6-11ef-06dc-7b8797c13fda
md"""
(1) Operations on probability distributions tend to lead to Gaussian distributions:
* Any smooth function with single rounded maximum, if raised to higher and higher powers, goes into a Gaussian function. (useful in sequential Bayesian inference).
* The [Gaussian distribution has higher entropy](https://en.wikipedia.org/wiki/Differential_entropy#Maximization_in_the_normal_distribution) than any other with the same variance.
* Therefore any operation on a probability distribution that discards information but preserves variance gets us closer to a Gaussian.
* As an example, see [Jaynes, section 7.1.4](http://www.med.mcgill.ca/epidemiology/hanley/bios601/GaussianModel/JaynesProbabilityTheory.pdf#page=250) for how this leads to the [Central Limit Theorem](https://en.wikipedia.org/wiki/Central_limit_theorem), which results from performing convolution operations on distributions.
"""
# ╔═╡ 0dd8677a-b7c6-11ef-357f-2328b10f5274
md"""
(2) Once the Gaussian has been attained, this form tends to be preserved. e.g.,
* The convolution of two Gaussian functions is another Gaussian function (useful in sum of 2 variables and linear transformations)
* The product of two Gaussian functions is another Gaussian function (useful in Bayes rule).
* The Fourier transform of a Gaussian function is another Gaussian function.
"""
# ╔═╡ 0dd86f40-b7c6-11ef-2ae8-a3954469bcee
md"""
### Transformations and Sums of Gaussian Variables
A **linear transformation** ``z=Ax+b`` of a Gaussian variable ``x \sim \mathcal{N}(\mu_x,\Sigma_x)`` is Gaussian distributed as
```math
p(z) = \mathcal{N} \left(z \,|\, A\mu_x+b, A\Sigma_x A^T \right) \tag{SRG-4a}
```
In fact, after a linear transformation ``z=Ax+b``, no matter how ``x`` is distributed, the mean and variance of ``z`` are always given by ``\mu_z = A\mu_x + b`` and ``\Sigma_z = A\Sigma_x A^T``, respectively (see [probability theory review lesson](https://nbviewer.jupyter.org/github/bertdv/BMLIP/blob/master/lessons/notebooks/Probability-Theory-Review.ipynb#linear-transformation)). In case ``x`` is not Gaussian, higher order moments may be needed to specify the distribution for ``z``.
"""
# ╔═╡ 0dd87a3a-b7c6-11ef-2bc2-bf2b4969537c
md"""
The **sum of two independent Gaussian variables** is also Gaussian distributed. Specifically, if ``x \sim \mathcal{N} \left(\mu_x, \Sigma_x \right)`` and ``y \sim \mathcal{N} \left(\mu_y, \Sigma_y \right)``, then the PDF for ``z=x+y`` is given by
```math
\begin{align*}
p(z) &= \mathcal{N}(x\,|\,\mu_x,\Sigma_x) \ast \mathcal{N}(y\,|\,\mu_y,\Sigma_y) \\
&= \mathcal{N} \left(z\,|\,\mu_x+\mu_y, \Sigma_x +\Sigma_y \right) \tag{SRG-8}
\end{align*}
```
The sum of two Gaussian *distributions* is NOT a Gaussian distribution. Why not?
"""
# ╔═╡ 0dd88110-b7c6-11ef-0b82-2ffe13a68cad
md"""
### Example: Gaussian Signals in a Linear System
<p style="text-align:center;"><img src="./figures/fig-linear-system.png" width="400px"></p>
Given independent variables
```math
x \sim \mathcal{N}(\mu_x,\sigma_x^2)
```
and ``y \sim \mathcal{N}(\mu_y,\sigma_y^2)``, what is the PDF for ``z = A\cdot(x -y) + b`` ? (for answer, see [Exercises](http://nbviewer.jupyter.org/github/bertdv/BMLIP/blob/master/lessons/exercises/Exercises-The-Gaussian-Distribution.ipynb))
"""
# ╔═╡ 0dd88a84-b7c6-11ef-133c-3d85f0703c19
md"""
Think about the role of the Gaussian distribution for stochastic linear systems in relation to what sinusoidals mean for deterministic linear system analysis.
"""
# ╔═╡ 0dd890ee-b7c6-11ef-04b7-e7671227d8cb
md"""
### Bayesian Inference for the Gaussian
Let's estimate a constant ``\theta`` from one 'noisy' measurement ``x`` about that constant.
We assume the following measurement equations (the tilde ``\sim`` means: 'is distributed as'):
```math
\begin{align*}
x &= \theta + \epsilon \\
\epsilon &\sim \mathcal{N}(0,\sigma^2)
\end{align*}
```
Also, let's assume a Gaussian prior for ``\theta``
```math
\begin{align*}
\theta &\sim \mathcal{N}(\mu_0,\sigma_0^2) \\
\end{align*}
```
"""
# ╔═╡ 0dd89b6e-b7c6-11ef-2525-73ee0242eb91
md"""
##### Model specification
Note that you can rewrite these specifications in probabilistic notation as follows:
```math
\begin{align*}
p(x|\theta) &= \mathcal{N}(x|\theta,\sigma^2) \\
p(\theta) &=\mathcal{N}(\theta|\mu_0,\sigma_0^2)
\end{align*}
```
"""
# ╔═╡ 0dd8b5d6-b7c6-11ef-1eb9-4f4289261e79
md"""
(**Notational convention**). Note that we write ``\epsilon \sim \mathcal{N}(0,\sigma^2)`` but not ``\epsilon \sim \mathcal{N}(\epsilon | 0,\sigma^2)``, and we write ``p(\theta) =\mathcal{N}(\theta|\mu_0,\sigma_0^2)`` but not ``p(\theta) =\mathcal{N}(\mu_0,\sigma_0^2)``.
"""
# ╔═╡ 0dd8c024-b7c6-11ef-3ca4-f9e8286cbb64
md"""
##### Inference
For simplicity, we assume that the variance ``\sigma^2`` is given and will proceed to derive a Bayesian posterior for the mean ``\theta``. The case for Bayesian inference of ``\sigma^2`` with a given mean is [discussed in the optional slides](#inference-for-precision).
"""
# ╔═╡ 0dd8d976-b7c6-11ef-051f-4f6cb3db3d1b
md"""
Let's do Bayes rule for the posterior PDF ``p(\theta|x)``.
```math
\begin{align*}
p(\theta|x) &= \frac{p(x|\theta) p(\theta)}{p(x)} \propto p(x|\theta) p(\theta) \\
&= \mathcal{N}(x|\theta,\sigma^2) \mathcal{N}(\theta|\mu_0,\sigma_0^2) \\
&\propto \exp \left\{ -\frac{(x-\theta)^2}{2\sigma^2} - \frac{(\theta-\mu_0)^2}{2\sigma_0^2} \right\} \\
&\propto \exp \left\{ \theta^2 \cdot \left( -\frac{1}{2 \sigma_0^2} - \frac{1}{2\sigma^2} \right) + \theta \cdot \left( \frac{\mu_0}{\sigma_0^2} + \frac{x}{\sigma^2}\right) \right\} \\
&= \exp\left\{ -\frac{\sigma_0^2 + \sigma^2}{2 \sigma_0^2 \sigma^2} \left( \theta - \frac{\sigma_0^2 x + \sigma^2 \mu_0}{\sigma^2 + \sigma_0^2}\right)^2 \right\}
\end{align*}
```
which we recognize as a Gaussian distribution w.r.t. ``\theta``.
"""
# ╔═╡ 0dd8df66-b7c6-11ef-011a-8d90bba8e2cd
md"""
(Just as an aside,) this computational 'trick' for multiplying two Gaussians is called **completing the square**. The procedure makes use of the equality
```math
ax^2+bx+c_1 = a\left(x+\frac{b}{2a}\right)^2+c_2
```
"""
# ╔═╡ 0dd8ea56-b7c6-11ef-0116-691b99023eb5
md"""
In particular, it follows that the posterior for ``\theta`` is
```math
\begin{equation*}
p(\theta|x) = \mathcal{N} (\theta |\, \mu_1, \sigma_1^2)
\end{equation*}
```
where
```math
\begin{align*}
\frac{1}{\sigma_1^2} &= \frac{\sigma_0^2 + \sigma^2}{\sigma^2 \sigma_0^2} = \frac{1}{\sigma_0^2} + \frac{1}{\sigma^2} \\
\mu_1 &= \frac{\sigma_0^2 x + \sigma^2 \mu_0}{\sigma^2 + \sigma_0^2} = \sigma_1^2 \, \left( \frac{1}{\sigma_0^2} \mu_0 + \frac{1}{\sigma^2} x \right)
\end{align*}
```
"""
# ╔═╡ 0dd8f1fe-b7c6-11ef-3386-e37f33577577
md"""
### (Multivariate) Gaussian Multiplication
So, multiplication of two Gaussian distributions yields another (unnormalized) Gaussian with
* posterior precision equals **sum of prior precisions**
* posterior precision-weighted mean equals **sum of prior precision-weighted means**
"""
# ╔═╡ 0dd8fbe2-b7c6-11ef-1f78-63dfd48146fd
md"""
As we just saw, a Gaussian prior, combined with a Gaussian likelihood, make Bayesian inference analytically solvable (!):
```math
\begin{equation*}
\underbrace{\text{Gaussian}}_{\text{posterior}}
\propto \underbrace{\text{Gaussian}}_{\text{likelihood}} \times \underbrace{\text{Gaussian}}_{\text{prior}}
\end{equation*}
```
"""
# ╔═╡ 0dd90644-b7c6-11ef-2fcf-2948d45f43bb
md"""
<a id="Gaussian-multiplication"></a>In general, the multiplication of two multi-variate Gaussians over ``x`` yields an (unnormalized) Gaussian over ``x``:
```math
\begin{equation*}
\boxed{\mathcal{N}(x|\mu_a,\Sigma_a) \cdot \mathcal{N}(x|\mu_b,\Sigma_b) = \underbrace{\mathcal{N}(\mu_a|\, \mu_b, \Sigma_a + \Sigma_b)}_{\text{normalization constant}} \cdot \mathcal{N}(x|\mu_c,\Sigma_c)} \tag{SRG-6}
\end{equation*}
```
where
```math
\begin{align*}
\Sigma_c^{-1} &= \Sigma_a^{-1} + \Sigma_b^{-1} \\
\Sigma_c^{-1} \mu_c &= \Sigma_a^{-1}\mu_a + \Sigma_b^{-1}\mu_b
\end{align*}
```
"""
# ╔═╡ 0dd91b7a-b7c6-11ef-1326-7bbfe5ac16bf
md"""
Check out that normalization constant ``\mathcal{N}(\mu_a|\, \mu_b, \Sigma_a + \Sigma_b)``. Amazingly, this constant can also be expressed by a Gaussian!
"""
# ╔═╡ 0dd9264e-b7c6-11ef-0fa9-d3e4e5053654
md"""
```math
\Rightarrow
```
Note that Bayesian inference is trivial in the [*canonical* parameterization of the Gaussian](#natural-parameterization), where we would get
```math
\begin{align*}
\Lambda_c &= \Lambda_a + \Lambda_b \quad &&\text{(precisions add)}\\
\eta_c &= \eta_a + \eta_b \quad &&\text{(precision-weighted means add)}
\end{align*}
```
This property is an important reason why the canonical parameterization of the Gaussian distribution is useful in Bayesian data processing.
"""
# ╔═╡ 0dd93204-b7c6-11ef-143e-2b7b182f8be1
md"""
### Code Example: Product of Two Gaussian PDFs
Let's plot the exact product of two Gaussian PDFs as well as the normalized product according to the above derivation.
"""
# ╔═╡ 0dd93236-b7c6-11ef-2656-b914f13c4ecd
using Plots, Distributions, LaTeXStrings
d1 = Normal(0, 1) # μ=0, σ^2=1
d2 = Normal(3, 2) # μ=3, σ^2=4
# Calculate the parameters of the product d1*d2
s2_prod = (d1.σ^-2 + d2.σ^-2)^-1
m_prod = s2_prod * ((d1.σ^-2)*d1.μ + (d2.σ^-2)*d2.μ)
d_prod = Normal(m_prod, sqrt(s2_prod)) # Note that we neglect the normalization constant.
# Plot stuff
x = range(-4, stop=8, length=100)
plot(x, pdf.(d1,x), label=L"\mathcal{N}(0,1)", fill=(0, 0.1)) # Plot the first Gaussian
plot!(x, pdf.(d2,x), label=L"\mathcal{N}(3,4)", fill=(0, 0.1)) # Plot the second Gaussian
plot!(x, pdf.(d1,x) .* pdf.(d2,x), label=L"\mathcal{N}(0,1) \mathcal{N}(3,4)", fill=(0, 0.1)) # Plot the exact product
plot!(x, pdf.(d_prod,x), label=L"Z^{-1} \mathcal{N}(0,1) \mathcal{N}(3,4)", fill=(0, 0.1)) # Plot the normalized Gaussian product
# ╔═╡ 0dd93f08-b7c6-11ef-3ad5-97d01baafa7c
md"""
### Bayesian Inference with multiple Observations
Now consider that we measure a data set ``D = \{x_1, x_2, \ldots, x_N\}``, with measurements
```math
\begin{aligned}
x_n &= \theta + \epsilon_n \\
\epsilon_n &\sim \mathcal{N}(0,\sigma^2)
\end{aligned}
```
and the same prior for ``\theta``:
```math
\theta \sim \mathcal{N}(\mu_0,\sigma_0^2) \\
```
Let's derive the distribution ``p(x_{N+1}|D)`` for the next sample .
"""
# ╔═╡ 0dd94cb4-b7c6-11ef-0d42-5f5f3b071afa
md"""
##### inference
First, we derive the posterior for ``\theta``:
```math
\begin{align*}
p(\theta|D) \propto \underbrace{\mathcal{N}(\theta|\mu_0,\sigma_0^2)}_{\text{prior}} \cdot \underbrace{\prod_{n=1}^N \mathcal{N}(x_n|\theta,\sigma^2)}_{\text{likelihood}}
\end{align*}
```
which is a multiplication of ``N+1`` Gaussians and is therefore also Gaussian-distributed.
"""
# ╔═╡ 0dd96092-b7c6-11ef-08b6-99348eca8529
md"""
Using the property that precisions and precision-weighted means add when Gaussians are multiplied, we can immediately write the posterior
```math
p(\theta|D) = \mathcal{N} (\theta |\, \mu_N, \sigma_N^2)
```
as
```math
\begin{align*}
\frac{1}{\sigma_N^2} &= \frac{1}{\sigma_0^2} + \sum_n \frac{1}{\sigma^2} \qquad &\text{(B-2.142)} \\
\mu_N &= \sigma_N^2 \, \left( \frac{1}{\sigma_0^2} \mu_0 + \sum_n \frac{1}{\sigma^2} x_n \right) \qquad &\text{(B-2.141)}
\end{align*}
```
"""
# ╔═╡ 0dd992ee-b7c6-11ef-3add-cdf7452bc514
md"""
##### application: prediction of future sample
We now have a posterior for the model parameters. Let's write down what we know about the next sample ``x_{N+1}``.
```math
\begin{align*}
p(x_{N+1}|D) &= \int p(x_{N+1}|\theta) p(\theta|D)\mathrm{d}\theta \\
&= \int \mathcal{N}(x_{N+1}|\theta,\sigma^2) \mathcal{N}(\theta|\mu_N,\sigma^2_N) \mathrm{d}\theta \\
&= \int \mathcal{N}(\theta|x_{N+1},\sigma^2) \mathcal{N}(\theta|\mu_N,\sigma^2_N) \mathrm{d}\theta \\
&= \int \mathcal{N}(x_{N+1}|\mu_N, \sigma^2_N +\sigma^2 ) \mathcal{N}(\theta|\cdot,\cdot)\mathrm{d}\theta \tag{use SRG-6} \\
&= \mathcal{N}(x_{N+1}|\mu_N, \sigma^2_N +\sigma^2 ) \underbrace{\int \mathcal{N}(\theta|\cdot,\cdot)\mathrm{d}\theta}_{=1} \\
&=\mathcal{N}(x_{N+1}|\mu_N, \sigma^2_N +\sigma^2 )
\end{align*}
```
"""
# ╔═╡ 0dd9a40a-b7c6-11ef-2864-8318d8f3d827
md"""
Uncertainty about ``x_{N+1}`` involved both uncertainty about the parameter (``\sigma_N^2``) and observation noise ``\sigma^2``.
"""
# ╔═╡ 0dd9b71a-b7c6-11ef-2c4a-a3f9e7f2bc87
md"""
### Maximum Likelihood Estimation for the Gaussian
In order to determine the *maximum likelihood* estimate of ``\theta``, we let ``\sigma_0^2 \rightarrow \infty`` (leads to uniform prior for ``\theta``), yielding $ \frac{1}{\sigma_N^2} = \frac{N}{\sigma^2}$ and consequently
```math
\begin{align*}
\mu_{\text{ML}} = \left.\mu_N\right\vert_{\sigma_0^2 \rightarrow \infty} = \sigma_N^2 \, \left( \frac{1}{\sigma^2}\sum_n x_n \right) = \frac{1}{N} \sum_{n=1}^N x_n
\end{align*}
```
"""
# ╔═╡ 0dd9ccfa-b7c6-11ef-2379-2967a0b4ad07
md"""
As expected, having an expression for the maximum likelihood estimate, it is now possible to rewrite the (Bayesian) posterior mean for ``\theta`` as
```math
\begin{align*}
\underbrace{\mu_N}_{\text{posterior}} &= \sigma_N^2 \, \left( \frac{1}{\sigma_0^2} \mu_0 + \sum_n \frac{1}{\sigma^2} x_n \right) \\
&= \frac{\sigma_0^2 \sigma^2}{N\sigma_0^2 + \sigma^2} \, \left( \frac{1}{\sigma_0^2} \mu_0 + \sum_n \frac{1}{\sigma^2} x_n \right) \\
&= \frac{ \sigma^2}{N\sigma_0^2 + \sigma^2} \mu_0 + \frac{N \sigma_0^2}{N\sigma_0^2 + \sigma^2} \mu_{\text{ML}} \\
&= \underbrace{\mu_0}_{\text{prior}} + \underbrace{\underbrace{\frac{N \sigma_0^2}{N \sigma_0^2 + \sigma^2}}_{\text{gain}}\cdot \underbrace{\left(\mu_{\text{ML}} - \mu_0 \right)}_{\text{prediction error}}}_{\text{correction}}\tag{B-2.141}
\end{align*}
```
"""
# ╔═╡ 0dd9db78-b7c6-11ef-1005-73e5d7a4fc4b
md"""
Hence, the posterior mean always lies somewhere between the prior mean ``\mu_0`` and the maximum likelihood estimate (the "data" mean) ``\mu_{\text{ML}}``.
"""
# ╔═╡ 0dd9ed22-b7c6-11ef-19e5-038711d75259
md"""
### Conditioning and Marginalization of a Gaussian
Let ``z = \begin{bmatrix} x \\ y \end{bmatrix}`` be jointly normal distributed as
```math
\begin{align*}
p(z) &= \mathcal{N}(z | \mu, \Sigma)
=\mathcal{N} \left( \begin{bmatrix} x \\ y \end{bmatrix} \left| \begin{bmatrix} \mu_x \\ \mu_y \end{bmatrix},
\begin{bmatrix} \Sigma_x & \Sigma_{xy} \\ \Sigma_{yx} & \Sigma_y \end{bmatrix} \right. \right)
\end{align*}
```
"""
# ╔═╡ 0dd9fb08-b7c6-11ef-0350-c529776149da
md"""
Since covariance matrices are by definition symmetric, it follows that ``\Sigma_x`` and ``\Sigma_y`` are symmetric and ``\Sigma_{xy} = \Sigma_{yx}^T``.
"""
# ╔═╡ 0dda09f4-b7c6-11ef-2429-377131c95b8e
md"""
Let's factorize ``p(z) = p(x,y)`` as ``p(x,y) = p(y|x) p(x)`` through conditioning and marginalization.
"""
# ╔═╡ 0dda16ce-b7c6-11ef-3b84-056673f08e89
md"""
```math
\begin{equation*}
\text{conditioning: }\boxed{ p(y|x) = \mathcal{N}\left(y\,|\,\mu_y + \Sigma_{yx}\Sigma_x^{-1}(x-\mu_x),\, \Sigma_y - \Sigma_{yx}\Sigma_x^{-1}\Sigma_{xy} \right)}
\end{equation*}
```
"""
# ╔═╡ 0dda22f4-b7c6-11ef-05ec-ef5e23c533a1
md"""
```math
\begin{equation*}
\text{marginalization: } \boxed{ p(x) = \mathcal{N}\left( x|\mu_x, \Sigma_x \right)}
\end{equation*}
```
"""
# ╔═╡ 0dda301e-b7c6-11ef-0188-0d6a9782abfa
md"""
**proof**: in Bishop pp.87-89
"""
# ╔═╡ 0dda3d8e-b7c6-11ef-0e2e-9942afc06c32
md"""
Hence, conditioning and marginalization in Gaussians leads to Gaussians again. This is very useful for applications to Bayesian inference in jointly Gaussian systems.
"""
# ╔═╡ 0dda4b3a-b7c6-11ef-17c2-5f5ccd912eee
md"""
With a natural parameterization of the Gaussian ``p(z) = \mathcal{N}_c(z|\eta,\Lambda)`` with precision matrix ``\Lambda = \Sigma^{-1} = \begin{bmatrix} \Lambda_x & \Lambda_{xy} \\ \Lambda_{yx} & \Lambda_y \end{bmatrix}``, the conditioning operation results in a simpler result, see Bishop pg.90, eqs. 2.96 and 2.97.
"""
# ╔═╡ 0dda6b2e-b7c6-11ef-14ee-25d9a3acaf11
md"""
As an exercise, interpret the formula for the conditional mean (``\mathbb{E}[y|x]=\mu_y + \Sigma_{yx}\Sigma_x^{-1}(x-\mu_x)``) as a prediction-correction operation.
"""
# ╔═╡ 0dda770e-b7c6-11ef-2988-397f0085c3a3
md"""
### Code Example: Joint, Marginal, and Conditional Gaussian Distributions
Let's plot of the joint, marginal, and conditional distributions.
"""
# ╔═╡ 0dda774a-b7c6-11ef-2750-4960eef0932b
using Plots, LaTeXStrings, Distributions
# Define the joint distribution p(x,y)
μ = [1.0; 2.0]
Σ = [0.3 0.7;
0.7 2.0]
joint = MvNormal(μ,Σ)
# Define the marginal distribution p(x)
marginal_x = Normal(μ[1], sqrt(Σ[1,1]))
# Plot p(x,y)
x_range = y_range = range(-2,stop=5,length=1000)
joint_pdf = [ pdf(joint, [x_range[i];y_range[j]]) for j=1:length(y_range), i=1:length(x_range)]
plot_1 = heatmap(x_range, y_range, joint_pdf, title = L"p(x, y)")
# Plot p(x)
plot_2 = plot(range(-2,stop=5,length=1000), pdf.(marginal_x, range(-2,stop=5,length=1000)), title = L"p(x)", label="", fill=(0, 0.1))
# Plot p(y|x = 0.1)
x = 0.1
conditional_y_m = μ[2]+Σ[2,1]*inv(Σ[1,1])*(x-μ[1])
conditional_y_s2 = Σ[2,2] - Σ[2,1]*inv(Σ[1,1])*Σ[1,2]
conditional_y = Normal(conditional_y_m, sqrt.(conditional_y_s2))
plot_3 = plot(range(-2,stop=5,length=1000), pdf.(conditional_y, range(-2,stop=5,length=1000)), title = L"p(y|x = %$x)", label="", fill=(0, 0.1))
plot(plot_1, plot_2, plot_3, layout=(1,3), size=(1200,300))
# ╔═╡ 0dda842e-b7c6-11ef-24b6-19e2fad91333
md"""
As is clear from the plots, the conditional distribution is a renormalized slice from the joint distribution.
"""
# ╔═╡ 0dda9086-b7c6-11ef-2455-732cd6d69407
md"""
### Example: Conditioning of Gaussian
Consider (again) the system
```math
\begin{align*}
p(x\,|\,\theta) &= \mathcal{N}(x\,|\,\theta,\sigma^2) \\
p(\theta) &= \mathcal{N}(\theta\,|\,\mu_0,\sigma_0^2)
\end{align*}
```
"""
# ╔═╡ 0dda9d36-b7c6-11ef-1ab4-7b341b8cfcdf
md"""
Let ``z = \begin{bmatrix} x \\ \theta \end{bmatrix}``. The distribution for ``z`` is then given by (Exercise)
```math
p(z) = p\left(\begin{bmatrix} x \\ \theta \end{bmatrix}\right) = \mathcal{N} \left( \begin{bmatrix} x\\
\theta \end{bmatrix}
\,\left|\, \begin{bmatrix} \mu_0\\
\mu_0\end{bmatrix},
\begin{bmatrix} \sigma_0^2+\sigma^2 & \sigma_0^2\\
\sigma_0^2 &\sigma_0^2
\end{bmatrix}
\right. \right)
```
"""
# ╔═╡ 0ddaa9f4-b7c6-11ef-01a0-a78e551e6414
md"""
Direct substitution of the rule for Gaussian conditioning leads to the <a id="precision-weighted-update">posterior</a> (derivation as an Exercise):
```math
\begin{align*}
p(\theta|x) &= \mathcal{N} \left( \theta\,|\,\mu_1, \sigma_1^2 \right)\,,
\end{align*}
```
with
```math
\begin{align*}
K &= \frac{\sigma_0^2}{\sigma_0^2+\sigma^2} \qquad \text{($K$ is called: Kalman gain)}\\
\mu_1 &= \mu_0 + K \cdot (x-\mu_0)\\
\sigma_1^2 &= \left( 1-K \right) \sigma_0^2
\end{align*}
```
"""
# ╔═╡ 0ddab62e-b7c6-11ef-1b65-df9e3d1087d6
md"""
```math
\Rightarrow
```
Moral: For jointly Gaussian systems, we can do inference simply in one step by using the formulas for conditioning and marginalization.
"""
# ╔═╡ 0ddae00e-b7c6-11ef-33f2-b565ce8fc3ba
md"""
### Recursive Bayesian Estimation for Adaptive Signal Processing
Consider the signal ``x_t=\theta+\epsilon_t``, where ``D_t= \left\{x_1,\ldots,x_t\right\}`` is observed *sequentially* (over time).
**Problem**: Derive a recursive algorithm for ``p(\theta|D_t)``, i.e., an update rule for (posterior) ``p(\theta|D_t)`` based on (prior) ``p(\theta|D_{t-1})`` and (new observation) ``x_t``.
"""
# ╔═╡ 0ddafb7a-b7c6-11ef-3c3f-c9fa7af39c92
md"""
##### Model specification
Let's define the estimate after ``t`` observations (i.e., our *solution* ) as ``p(\theta|D_t) = \mathcal{N}(\theta\,|\,\mu_t,\sigma_t^2)``.
We define the joint distribution for ``\theta`` and ``x_t``, given background ``D_{t-1}``, by
```math
\begin{align*} p(x_t,\theta \,|\, D_{t-1}) &= p(x_t|\theta) \, p(\theta|D_{t-1}) \\
&= \underbrace{\mathcal{N}(x_t\,|\, \theta,\sigma^2)}_{\text{likelihood}} \, \underbrace{\mathcal{N}(\theta\,|\,\mu_{t-1},\sigma_{t-1}^2)}_{\text{prior}}
\end{align*}
```
"""
# ╔═╡ 0ddb085c-b7c6-11ef-34fd-6b1b18a95ff1
md"""
##### Inference
Use Bayes rule,
```math
\begin{align*}
p(\theta|D_t) &= p(\theta|x_t,D_{t-1}) \\
&\propto p(x_t,\theta | D_{t-1}) \\
&= p(x_t|\theta) \, p(\theta|D_{t-1}) \\
&= \mathcal{N}(x_t|\theta,\sigma^2) \, \mathcal{N}(\theta\,|\,\mu_{t-1},\sigma_{t-1}^2) \\
&= \mathcal{N}(\theta|x_t,\sigma^2) \, \mathcal{N}(\theta\,|\,\mu_{t-1},\sigma_{t-1}^2) \;\;\text{(note this trick)}\\
&= \mathcal{N}(\theta|\mu_t,\sigma_t^2) \;\;\text{(use Gaussian multiplication formula SRG-6)}
\end{align*}
```
with
```math
\begin{align*}
K_t &= \frac{\sigma_{t-1}^2}{\sigma_{t-1}^2+\sigma^2} \qquad \text{(Kalman gain)}\\
\mu_t &= \mu_{t-1} + K_t \cdot (x_t-\mu_{t-1})\\
\sigma_t^2 &= \left( 1-K_t \right) \sigma_{t-1}^2
\end{align*}
```
"""
# ╔═╡ 0ddb163a-b7c6-11ef-2b06-a1d6677b7191
md"""
This linear *sequential* estimator of mean and variance in Gaussian observations is called a **Kalman Filter**.
<!–- - The new observation ``x_t`` 'updates' the old estimate ``\mu_{t-1}`` by a quantity that is proportional to the *innovation* (or *residual*) ``\left( x_t - \mu_{t-1} \right)``. –->
"""
# ╔═╡ 0ddb2302-b7c6-11ef-1f50-27711dbe4d33
md"""
The so-called Kalman gain ``K_t`` serves as a "learning rate" (step size) in the parameter update equation ``\mu_t = \mu_{t-1} + K_t \cdot (x_t-\mu_{t-1})``. Note that *you* don't need to choose the learning rate. Bayesian inference computes its own (optimal) learning rates.
"""
# ╔═╡ 0ddb2fa0-b7c6-11ef-3ac5-8979f2a0a00c
md"""
Note that the uncertainty about ``\theta`` decreases over time (since ``0<(1-K_t)<1``). If we assume that the statistics of the system do not change (stationarity), each new sample provides new information about the process, so the uncertainty decreases.
"""
# ╔═╡ 0ddb3c34-b7c6-11ef-2a77-895cbc5796f3
md"""
Recursive Bayesian estimation as discussed here is the basis for **adaptive signal processing** algorithms such as Least Mean Squares (LMS) and Recursive Least Squares (RLS). Both RLS and LMS are special cases of Recursive Bayesian estimation.
"""
# ╔═╡ 0ddb4b54-b7c6-11ef-121d-5d00e547debd
md"""
### Code Example: Kalman Filter
Let's implement the Kalman filter described above. We'll use it to recursively estimate the value of ``\theta`` based on noisy observations.
"""
# ╔═╡ 0ddb4bb4-b7c6-11ef-373a-ab345190363a
using Plots, Distributions
n = 100 # specify number of observations
θ = 2.0 # true value of the parameter we would like to estimate
noise_σ2 = 0.3 # variance of observation noise
observations = noise_σ2 * randn(n) .+ θ
function perform_kalman_step(prior :: Normal, x :: Float64, noise_σ2 :: Float64)
K = prior.σ / (noise_σ2 + prior.σ) # compute the Kalman gain
posterior_μ = prior.μ + K*(x - prior.μ) # update the posterior mean
posterior_σ = prior.σ * (1.0 - K) # update the posterior standard deviation
return Normal(posterior_μ, posterior_σ) # return the posterior distribution
end
post_μ = fill!(Vector{Float64}(undef,n + 1), NaN) # means of p(θ|D) over time
post_σ2 = fill!(Vector{Float64}(undef,n + 1), NaN) # variances of p(θ|D) over time
prior = Normal(0, 1) # specify the prior distribution (you can play with the parameterization of this to get a feeling of how the Kalman filter converges)
post_μ[1] = prior.μ # save prior mean and variance to show these in plot
post_σ2[1] = prior.σ
for (i, x) in enumerate(observations) # note that this loop demonstrates Bayesian learning on streaming data; we update the prior distribution using observation(s), after which this posterior becomes the new prior for future observations
posterior = perform_kalman_step(prior, x, noise_σ2) # compute the posterior distribution given the observation
post_μ[i + 1] = posterior.μ # save the mean of the posterior distribution
post_σ2[i + 1] = posterior.σ # save the variance of the posterior distribution
prior = posterior # the posterior becomes the prior for future observations
end
obs_scale = collect(2:n+1)
scatter(obs_scale, observations, label=L"D", )
post_scale = collect(1:n+1) # scatter the observations
plot!(post_scale, post_μ, ribbon=sqrt.(post_σ2), linewidth=3, label=L"p(θ | D_t)") # lineplot our estimated means of intermediate posterior distributions
plot!(post_scale, θ*ones(n + 1), linewidth=2, label=L"θ") # plot the true value of θ
# ╔═╡ 0ddb7294-b7c6-11ef-0585-3f1a218aeb42
md"""
The shaded area represents 2 standard deviations of posterior ``p(\theta|D)``. The variance of the posterior is guaranteed to decrease monotonically for the standard Kalman filter.
"""
# ╔═╡ 0ddb9904-b7c6-11ef-3808-35b8ee37dd04
md"""
### <a id="product-of-gaussians">Product of Normally Distributed Variables</a>
(We've seen that) the sum of two Gausssian distributed variables is also Gaussian distributed.
"""
# ╔═╡ 0ddba9ee-b7c6-11ef-3148-9db5fbb13d77
md"""
Has the *product* of two Gaussian distributed variables also a Gaussian distribution?
"""
# ╔═╡ 0ddbba2e-b7c6-11ef-04cf-1119024af1d1
md"""
**No**! In general this is a difficult computation. As an example, let's compute ``p(z)`` for ``Z=XY`` for the special case that ``X\sim \mathcal{N}(0,1)`` and ``Y\sim \mathcal{N}(0,1)``.
```math
\begin{align*}
p(z) &= \int_{X,Y} p(z|x,y)\,p(x,y)\,\mathrm{d}x\mathrm{d}y \\
&= \frac{1}{2 \pi}\int \delta(z-xy) \, e^{-(x^2+y^2)/2} \, \mathrm{d}x\mathrm{d}y \\
&= \frac{1}{\pi} \int_0^\infty \frac{1}{x} e^{-(x^2+z^2/x^2)/2} \, \mathrm{d}x \\
&= \frac{1}{\pi} \mathrm{K}_0( \lvert z\rvert )\,.
\end{align*}
```
where ``\mathrm{K}_n(z)`` is a [modified Bessel function of the second kind](http://mathworld.wolfram.com/ModifiedBesselFunctionoftheSecondKind.html).
"""
# ╔═╡ 0ddbc78a-b7c6-11ef-2ce4-f76fa4153e4b
md"""
### Code Example: Product of Gaussian Distributions
We plot ``p(Z=XY)`` and ``p(X)p(Y)`` for ``X\sim\mathcal{N}(0,1)`` and ``Y \sim \mathcal{N}(0,1)`` to give an idea of how these distributions differ.
"""
# ╔═╡ 0ddbc7c8-b7c6-11ef-004f-8bfaa5f29eba
using Plots, Distributions, SpecialFunctions, LaTeXStrings
X = Normal(0,1)
Y = Normal(0,1)
pdf_product_std_normals(z::Vector) = (besselk.(0, abs.(z))./π)
range1 = collect(range(-4,stop=4,length=100))
plot(range1, pdf.(X, range1), label=L"p(X)=p(Y)=\mathcal{N}(0,1)", fill=(0, 0.1))
plot!(range1, pdf.(X,range1).*pdf.(Y,range1), label=L"p(X)*p(Y)", fill=(0, 0.1))
plot!(range1, pdf_product_std_normals(range1), label=L"p(Z=X*Y)", fill=(0, 0.1))
# ╔═╡ 0ddbd3ce-b7c6-11ef-20e1-070d736f7b95
md"""
In short, Gaussian-distributed variables remain Gaussian in linear systems, but this is not the case in non-linear systems.
"""
# ╔═╡ 0ddbf246-b7c6-11ef-16a5-bbf396f80915
md"""
### Solution to Example Problem
We apply maximum likelihood estimation to fit a 2-dimensional Gaussian model (``m``) to data set ``D``. Next, we evaluate ``p(x_\bullet \in S | m)`` by (numerical) integration of the Gaussian pdf over ``S``: ``p(x_\bullet \in S | m) = \int_S p(x|m) \mathrm{d}x``.
"""
# ╔═╡ 0ddbf278-b7c6-11ef-20f5-7ffd3163b14f
using HCubature, LinearAlgebra, Plots, Distributions# Numerical integration package
# Maximum likelihood estimation of 2D Gaussian
N = length(sum(D,dims=1))
μ = 1/N * sum(D,dims=2)[:,1]
D_min_μ = D - repeat(μ, 1, N)
Σ = Hermitian(1/N * D_min_μ*D_min_μ')
m = MvNormal(μ, convert(Matrix, Σ));
contour(range(-3, 4, length=100), range(-3, 4, length=100), (x, y) -> pdf(m, [x, y]))
# Numerical integration of p(x|m) over S:
(val,err) = hcubature((x)->pdf(m,x), [0., 1.], [2., 2.])
println("p(x⋅∈S|m) ≈ $(val)")
scatter!(D[1,:], D[2,:], marker=:x, markerstrokewidth=3, label=L"D")
scatter!([x_dot[1]], [x_dot[2]], label=L"x_\bullet")
plot!(range(0, 2), [1., 1., 1.], fillrange=2, alpha=0.4, color=:gray, label=L"S")
# ╔═╡ 0ddc02d6-b7c6-11ef-284e-018c7895536e
md"""
### Summary
A **linear transformation** ``z=Ax+b`` of a Gaussian variable ``x \sim \mathcal{N}(\mu_x,\Sigma_x)`` is Gaussian distributed as
```math
p(z) = \mathcal{N} \left(z \,|\, A\mu_x+b, A\Sigma_x A^T \right)
```
Bayesian inference with a Gaussian prior and Gaussian likelihood leads to an analytically computable Gaussian posterior, because of the **multiplication rule for Gaussians**:
```math
\begin{equation*}
\mathcal{N}(x|\mu_a,\Sigma_a) \cdot \mathcal{N}(x|\mu_b,\Sigma_b) = \underbrace{\mathcal{N}(\mu_a|\, \mu_b, \Sigma_a + \Sigma_b)}_{\text{normalization constant}} \cdot \mathcal{N}(x|\mu_c,\Sigma_c)
\end{equation*}
```
where
```math
\begin{align*}
\Sigma_c^{-1} &= \Sigma_a^{-1} + \Sigma_b^{-1} \\
\Sigma_c^{-1} \mu_c &= \Sigma_a^{-1}\mu_a + \Sigma_b^{-1}\mu_b
\end{align*}
```
**Conditioning and marginalization** of a multivariate Gaussian distribution yields Gaussian distributions. In particular, the joint distribution
```math
\mathcal{N} \left( \begin{bmatrix} x \\ y \end{bmatrix} \left| \begin{bmatrix} \mu_x \\ \mu_y \end{bmatrix},
\begin{bmatrix} \Sigma_x & \Sigma_{xy} \\ \Sigma_{yx} & \Sigma_y \end{bmatrix} \right. \right)
```
can be decomposed as
```math
\begin{align*}
p(y|x) &= \mathcal{N}\left(y\,|\,\mu_y + \Sigma_{yx}\Sigma_x^{-1}(x-\mu_x),\, \Sigma_y - \Sigma_{yx}\Sigma_x^{-1}\Sigma_{xy} \right) \\
p(x) &= \mathcal{N}\left( x|\mu_x, \Sigma_x \right)
\end{align*}
```
Here's a nice [summary of Gaussian calculations](https://github.com/bertdv/AIP-5SSB0/raw/master/lessons/notebooks/files/RoweisS-gaussian_formulas.pdf) by Sam Roweis.
"""
# ╔═╡ 0ddc1028-b7c6-11ef-1eec-6d72e52f4431
md"""
## <center> OPTIONAL SLIDES</center>
"""
# ╔═╡ 0ddc1c2e-b7c6-11ef-00b6-e98913a96420
md"""
### <a id="inference-for-precision">Inference for the Precision Parameter of the Gaussian</a>
Again, we consider an observed data set ``D = \{x_1, x_2, \ldots, x_N\}`` and try to explain these data by a Gaussian distribution.
"""
# ╔═╡ 0ddc287e-b7c6-11ef-1f72-910e6e7b06bb
md"""
We discussed earlier Bayesian inference for the mean with a given variance. Now we will derive a posterior for the variance if the mean is given. (Technically, we will do the derivation for a precision parameter ``\lambda = \sigma^{-2}``, since the discussion is a bit more straightforward for the precision parameter).
"""
# ╔═╡ 0ddc367a-b7c6-11ef-38f9-09fb462987dc
md"""
##### model specification
The likelihood for the precision parameter is
```math
\begin{align*}
p(D|\lambda) &= \prod_{n=1}^N \mathcal{N}\left(x_n \,|\, \mu, \lambda^{-1} \right) \\
&\propto \lambda^{N/2} \exp\left\{ -\frac{\lambda}{2}\sum_{n=1}^N \left(x_n - \mu \right)^2\right\} \tag{B-2.145}
\end{align*}
```
"""
# ╔═╡ 0ddc4796-b7c6-11ef-2156-8b3d6899a8c0
md"""
The conjugate distribution for this function of ``\lambda`` is the [*Gamma* distribution](https://en.wikipedia.org/wiki/Gamma_distribution), given by
```math
p(\lambda\,|\,a,b) = \mathrm{Gam}\left( \lambda\,|\,a,b \right) \triangleq \frac{1}{\Gamma(a)} b^{a} \lambda^{a-1} \exp\left\{ -b \lambda\right\}\,, \tag{B-2.146}
```
where ``a>0`` and ``b>0`` are known as the *shape* and *rate* parameters, respectively.
<img src="./figures/B-fig-2.13.png" width="600px">
(Bishop fig.2.13). Plots of the Gamma distribution ``\mathrm{Gam}\left( \lambda\,|\,a,b \right) $ for different values of $a`` and ``b``.