-
Notifications
You must be signed in to change notification settings - Fork 169
/
Exercise_11_ModelAssessment.Rmd
341 lines (265 loc) · 15.7 KB
/
Exercise_11_ModelAssessment.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
Model Assessment
========================================================
In this activity we will use a series of visualizations and statistical measures to assess the performance of our Super Simple Ecosystem Model at the Metolius site.
Let's start by loading the ensemble output from the previous lab and the observed flux data for the site.
```{r}
## load libraries
library("plotrix")
library(rpart)
library(randomForest)
library(scoringRules)
## load SSEM output
load("data/Ex10.output.RData")
load("data/Ex10.output.pf.RData")
## load flux tower data
L4 = read.csv("data/AMF_USMe2_2005_L4_h_V002.txt",header=TRUE,na.strings="-9999")
L4[L4==-9999] = NA
```
Model vs. Data
--------------
In the following section we will begin with some basic diagnostic plots and statistics assessing the predicted NEE by our simple ecosystem model. Specifically, we will calculate the Root Mean Square Error (RMSE), bias, correlation coefficient, and regression slopes of the relationship between the observed and predicted NEE for both the original ensemble and the particle filter. We will also generate scatter plots of predicted vs. observed values.
```{r}
## Calculate ensemble means & apply QAQC
qaqc = (L4$qf_NEE_st == 0)
NEE.ens.bar = -apply(NEE.ens[,],1,mean)
NEE.pf.bar = -apply(NEE.pf[,],1,mean)
E = NEE.ens.bar[qaqc]
P = NEE.pf.bar[qaqc]
O = L4$NEE_st_fMDS[qaqc]
## Model vs obs regressions
NEE.ens.fit = lm(O ~ E)
NEE.pf.fit = lm(O ~ P)
## performance stats
stats = as.data.frame(matrix(NA,4,2))
rownames(stats) <- c("RMSE","Bias","cor","slope")
colnames(stats) <- c("ens","pf")
stats["RMSE",'ens'] = sqrt(mean((E-O)^2))
stats["RMSE",'pf'] = sqrt(mean((P-O)^2))
stats['Bias','ens'] = mean(E-O)
stats['Bias','pf'] = mean(P-O)
stats['cor','ens'] = cor(E,O)
stats['cor','pf'] = cor(P,O)
stats['slope','ens'] = coef(NEE.ens.fit)[2]
stats['slope','pf'] = coef(NEE.pf.fit)[2]
knitr::kable(stats)
## predicted-observed
plot(E,O,pch=".",xlab="ensemble",ylab='observed',main='NEE (umol/m2/sec)')
abline(0,1,col=2,lwd=2)
abline(NEE.ens.fit,col=3,lwd=3,lty=2)
legend("bottomright",legend=c('obs','1:1','reg'),col=1:3,lwd=3)
plot(P,O,pch=".",xlab="particle filter",ylab='observed',main='NEE (umol/m2/sec)')
abline(0,1,col=2,lwd=2)
abline(NEE.pf.fit,col=3,lwd=3,lty=2)
legend("bottomright",legend=c('obs','1:1','reg'),col=1:3,lwd=3)
```
Question 1 [A]: Which version of the model performed better? Do the statistics or plots give any indication about what parameters might need to be fixed, or processeses refined in the model?
Question 2 [B]: Plot a time-series of observed and modeled **daily mean** NEE. Make sure to use the gap-filled NEE estimates, since flux data are not missing at random. Hint: you can use `tapply` and the day of year column in the data
Taylor diagram
--------------
Next, let's use a Taylor diagram to pull our summary statistics together into one plot. One of the advantages of the Taylor diagram is that it makes it simpler to visually diagnose the relative differences in model performance, especially when comparing multiple models or different versions of the same model. In the figure below we'll begin by plotting the ensemble, the particle filter, and the climatology. While not common, the Taylor diagram also provides a way of expressing model and data uncertainty in the plot by plotting ensemble estimates of both. Below we add all 200 members of the model ensemble, as well as a Monte Carlo estimate of observation error in the flux data. The latter is derived based on the research by Richardson et al (2006), who showed that eddy covariance data has a non-symmetric heteroskedastic, Laplace distribution. The non-symmetric part refers to the fact that there is greater error in positive fluxes (= respiration, typically nocturnal measurments) than in negative ones.
```{r}
## Taylor diagrams: Ensemble
taylor.diagram(ref=O,model=E,normalize=TRUE,ref.sd=TRUE)
## add full ensemble
for(i in 1:ncol(NEE.pf)){
taylor.diagram(ref=O,model=-NEE.ens[qaqc,i],col=2,pch=".",add=TRUE,normalize=TRUE)
}
# Taylor Diagram: particle filter
taylor.diagram(ref=O,model=P,add=TRUE,normalize=TRUE,col=3)
## add data uncertainty
rlaplace = function(n,mu,b){
return(mu + ifelse(rbinom(n,1,0.5),1,-1)*rexp(n,b))
}
beta = ifelse(O > 0,0.62+0.63*O,1.42-0.19*O) #Heteroskedasticity, parameters from Richardson et al 2006
for(i in 1:200){
x = rlaplace(length(O),O,beta)
taylor.diagram(ref=O,model=x,col=5,add=TRUE,normalize=TRUE)
}
legend("topright",legend=c("ens","PF","obsUncert"),col=2:5,pch=20,cex=0.7)
```
Question 3 [A]: What did you learn about model performance from the Taylor diagram?
Question 4 [B]: How do our simple models compare to the ensemble of ecosystem models in Figure 7 of Schwalm et al 2010 "A model-data intercomparison of CO2 exchange across North America: results from the north american carbon program site synthesis". J. Geophys. Res. ?
Bayesian p-values
-----------------
Next, let's look at the patterns in the Bayesian 'p-values', which is essentially a plot of the quantiles of the observed values relative to the predictive distributions. The ideal distribution of quantiles if flat (values show up as frequently as predicted), overcalibrated models tend to have too many observations near 50%, while poorly calibrated models will tend to produce a lot of 0's and 1's (consistently under- or over-predicting).
```{r}
O = L4$NEE_st_fMDS ## observed
pval.pf = 0
for(i in 1:nrow(NEE.pf)){
pval.pf[i] = sum(O[i] > -NEE.pf[i,])/ncol(NEE.pf) ## quantiles of the particle filter
}
plot(pval.pf) ## quantile 'residuals'
hist(pval.pf,probability=TRUE) ## quantile distribution (should be flat)
```
Question 5 [A]: How do **both** the ensemble and particle filter perform in terms of the predicted quantiles?
CRPS
----
Building on our ideas related to model skill scores (e.g. accuracy vs precision) and our ideas about assessing probabilistic forecasts (e.g. Bayesian p-values), is the idea of skill scores for probabalistic forecasts. A number of such scores exist, with one of the more popular being the Continuous Rank Probability Score (CRPS).
$$CRPS = {1 \over m} \sum_{i=1}^m|X_i-y|
- {1 \over {2m^2}}\sum_{i=1}^m\sum_{j=1}^m|X_i-X_j|$$
Where y is the data and the X's are ensemble members within the prediciton. The first term in CRPS represents the mean absolute error. The second applies a penalty based on the magnitude of the ensemble spread. To start, let's apply CRPS to the particle filter forecast, which returns a score for every data point. To summarize and visualize this we then calculate the average CRPS, plot CRPS over time, and calculate a histogram to visualize the distribution.
```{r}
crps = scoringRules::crps_sample(y = O, dat = NEE.pf)
mean(crps)
plot(crps)
hist(crps)
```
It's not particularly easy to see how the model is doing from the timeseries, so we also calculate the mean diurnal cycle and the relationship between model skill and flux
```{r}
### diurnal cycle of model skill
crps.diurnal = tapply(crps,L4$Hour,mean)
time.of.day = as.numeric(names(crps.diurnal))
plot(time.of.day,crps.diurnal)
## CRPS skill as a function of observed NEE
plot(O,crps)
```
Question 6 [A]: Compare the CRPS for the ensemble and PF models. How do they perform overall (mean) in terms of their ability to produce the correct seasonal and diurnal cycles? Explain the pattern of the relationship between NEE and CRPS.
Mining the Residuals
--------------------
In the final section we'll use a few off-the-shelf data mining approaches to look at the model residuals and ask what parts of our input space are associated with the largest model error. Note that we are not limited to just examining the effects of the model inputs, we might also look at other potential drivers that are not included in our model, such as soil moisture, to ask if model error is associated with our failure to include this (or other) drivers. Alternatively, we could have looked at other factors such as the time of day or even other model variables (e.g. is model error higher when LAI is larger or small?)
Of the many algorithms out there we'll look at two: the Classification and Regression Tree (CART) model and the Random Forest model. For both we'll define our error metric as $(E-O)/beta$, where beta is the parameter equivalent to the variance in Laplace distribution. Specifically, we're using the heteroskedastic observation error to reweight the residuals to account for the fact that large residuals at times of high flux is likely due to high measurement error. Thus the errors can be interpreted as similar to the number of of standard deviations.
The CART model is a classification algorithm which will build a tree that discretely classifies when the model has high and low error.
The Random Forest model is more like a response surface. The Random Forest will generate 'partial dependence' plots, which indicate the importance of each factor across its range, as well as an overall estimate of the importance of each factor in the model error.
The key thing to remember in all these plots is that we're modelling the RESIDUALS in order to diagnose errors, not modeling the NEE itself.
```{r}
## define error metric and dependent variables
O = L4$NEE_st_fMDS[qaqc]
err = (E-O)/beta
x = cbind(inputs$PAR[qaqc],inputs$temp[qaqc])
colnames(x) = c("PAR","temp")
smp = sample.int(length(err),1000) ## take a sample of the data since some alg. are slow
### Classification tree
rpb = rpart(err ~ x) ## bias
plot(rpb)
text(rpb)
e2 = err^2
rpe = rpart(e2 ~ x) ## sq error
plot(rpe)
text(rpe)
## Random Forest
rfe = randomForest(x[smp,],abs(err[smp]))
rfe$importance
partialPlot(rfe,x[smp,],"PAR")
partialPlot(rfe,x[smp,],"temp")
```
Question 7 [A]: Overall, which driver is most important in explaining model error? What conditions are most associated with model success? With model failure? Where do these results reinforce conclusions we reached earlier and where do they shine light on new patterns you may have missed earlier?
Functional Responses
--------------------
In this section we look at how well the model performed by assessing the modeled relationships between inputs and outputs and comparing that to the same relationship in the data. The raw relationships are very noisy, as many covariates are changing beyond just the single input variable we are evaluating, so in addition we calculate binned means for both the model and data.
```{r}
## raw
plot(inputs$temp[qaqc],O,pch=".",ylab="NEE")
points(inputs$temp[qaqc],E,pch=".",col=2)
## binned
nbin = 25
#PAR = inputs$PAR[qaqc]
#x = seq(min(PAR),max(PAR),length=nbin)
Tair = inputs$temp[qaqc]
xd = seq(min(Tair),max(Tair),length=nbin)
xmid = xd[-length(xd)] + diff(xd)
bin = cut(Tair,xd)
Obar = tapply(O,bin,mean,na.rm=TRUE)
Ose = tapply(O,bin,std.error,na.rm=TRUE)
Ebar = tapply(E,bin,mean,na.rm=TRUE)
Ese = tapply(E,bin,std.error,na.rm=TRUE)
OCI = -cbind(Obar-1.96*Ose,Obar,Obar+1.96*Ose)
ECI = -cbind(Ebar-1.96*Ese,Ebar,Ebar+1.96*Ese)
rng = range(rbind(OCI,ECI))
col2=ecoforecastR::col.alpha("darkgrey",0.9)
col1=ecoforecastR::col.alpha("lightgrey",0.6)
plot(xmid,Obar,ylim=rng,type='n',xlab="Air Temperature (C)",ylab="NEP (umol/m2/s)",cex.lab=1.3)
ecoforecastR::ciEnvelope(xmid,ECI[,1],ECI[,3],col=col2)
lines(xmid,ECI[,2],col="white",lwd=4)
ecoforecastR::ciEnvelope(xmid,OCI[,1],OCI[,3],col=col1)
lines(xmid,OCI[,2],col="lightgrey",lwd=4)
legend("bottom",legend=c("Model","Data"),lwd=10,col=c(col2,col1),lty=1,cex=1.7)
```
Question 8 [A]: Evaluate the model's ability to capture functional responses to both Temperature and PAR.
Overall
-------
Below is a final summary figure of the model's performance on a daily timescale that combines many of the previous assessments.
```{r}
### other summary figures to go in multi-panel
par(mfrow=c(2,2))
## Time-series visualization, daily means
DoY = floor(L4$DoY-0.02)
uDoY = sort(unique(DoY))
ci.pf = apply(apply(NEE.pf[,],2,tapply,DoY,mean),1,mean)
NEE = -L4$NEE_st_fMDS
NEEd = tapply(NEE,DoY,mean)
plot(uDoY,ci.pf,xlab="time",ylab="NEE",type='l',ylim=range(c(ci.pf,NEEd)),cex.lab=1.3)
points(uDoY,NEEd,col=2,pch="+")
legend("topright",legend=c("Model","Data"),lty=c(1,NA),pch=c(NA,"+"),col=1:2,cex=1.3)
## predicted vs observed
plot(NEEd,ci.pf,xlab="Model",ylab="Data",cex.lab=1.3)
abline(0,1,lty=2,lwd=4)
abline(lm(ci.pf ~ NEEd),col=2,lwd=3,lty=3)
legend("topleft",legend=c("1:1","Reg"),lty=2:3,lwd=4,col=1:2,cex=1.3)
## Functional response
plot(xmid,Obar,ylim=rng,type='n',xlab="Air Temperature (C)",ylab="NEP (umol/m2/s)",cex.lab=1.3)
ecoforecastR::ciEnvelope(xmid,ECI[,1],ECI[,3],col=col2)
lines(xmid,ECI[,2],col="white",lwd=4)
ecoforecastR::ciEnvelope(xmid,OCI[,1],OCI[,3],col=col1)
lines(xmid,OCI[,2],col="lightgrey",lwd=4)
legend("bottom",legend=c("Model","Data"),lwd=10,col=c(col2,col1),lty=1,cex=1.3)
### Classification tree
par(mar=c(0,0,0,0))
rpe = rpart(e2 ~ PAR+temp,as.data.frame(x),method="anova") ## sq error
plot(rpe,margin=0.1)
text(rpe,cex=1.5)
```
# Extra Material
Comparison to flux "climatology"
-------------------------------
In the section below we calculate the long-term average NEE for each 30 min period in the year, excluding the year we modeled (2005) as an alternative model to judge our process model against. We then update our summary statistics and predicted-observed plot
```{r}
## flux "climatology"
fluxfiles = dir("data",pattern="AMF")
fluxfiles = fluxfiles[grep("txt",fluxfiles)]
fluxfiles = fluxfiles[-grep("2005",fluxfiles)]
clim.NEE = clim.doy = NULL
for(f in fluxfiles){
ff = read.csv(file.path("data",f),header=TRUE,na.strings="-9999")
ff[ff == -9999] = NA
clim.NEE = c(clim.NEE,ff$NEE_st_fMDS)
clim.doy = c(clim.doy,ff$DoY)
}
NEE.clim=tapply(clim.NEE,clim.doy,mean,na.rm=TRUE)[1:length(qaqc)]
C = NEE.clim[qaqc]
NEE.clim.fit = lm(O ~ C)
summary(NEE.clim.fit)
stats["RMSE",3] = sqrt(mean((C-O)^2))
stats['Bias',3] = mean(C-O)
stats['cor',3] = cor(C,O)
stats['slope',3] = coef(NEE.clim.fit)[2]
colnames(stats)[3] <- "clim"
knitr::kable(stats)
plot(C,O,pch=".",xlab="climatology",ylab='observed',main='NEE (umol/m2/sec)')
abline(0,1,col=2,lwd=2)
abline(NEE.clim.fit,col=3,lwd=3,lty=2)
legend("bottomright",legend=c('obs','1:1','reg'),col=1:3,lwd=3)
## example cycle
plot(L4$DoY,-L4$NEE_st_fMDS,xlim=c(200,210),type='l',lwd=2,ylim=c(-10,20),xlab="Day of Year",ylab="NEE")
lines(L4$DoY,-NEE.clim,col=4,lwd=2,lty=2)
legend("topright",legend=c("Obs","clim"),lty=1:2,col=c(1,4),lwd=2)
```
Question 9 [B]: How does the process model perform relative to the average flux data? Which statistics showed the largest differences between the model and climatology?
Time-scales
-----------
Many ecological processes operate at multiple time scales. For example, carbon flux data responds to the diurnal cycle of light and temperature, meso-scale variability due to weather fronts, seasonal variability, and inter-annual variability driven by longer-term climate modes, as well as disturbance and succession.
In the next section we look at the average diurnal cycle of the data and models.
```{r}
## diurnal cycle
NEE.ens.diurnal = tapply(E,L4$Hour[qaqc],mean)
NEE.pf.diurnal = tapply(P,L4$Hour[qaqc],mean)
NEE.clim.diurnal = tapply(C,L4$Hour[qaqc],mean)
NEE.obs.diurnal = tapply(O,L4$Hour[qaqc],mean)
ylim=range(c(NEE.ens.diurnal,NEE.pf.diurnal,NEE.obs.diurnal))
tod = sort(unique(L4$Hour))
plot(tod,NEE.ens.diurnal,ylim=ylim,col=2,xlab="Time of Day",ylab='NEE',main="Diurnal Cycle",type='l',lwd=3)
lines(tod,NEE.pf.diurnal,col=3,lwd=3)
lines(tod,NEE.clim.diurnal,col=4,lwd=3)
lines(tod,NEE.obs.diurnal,lwd=3)
legend("bottomright",legend=c("obs","ens","PF","clim"),col=1:4,pch=20,cex=0.75)
```
Question 10 [C]: What time of day has the largest uncertainty? What does this suggest about what parameter(s) needs to be modified in the model, in what direction, and by approximately how much? In providing this answer, recall the structure of the model as well as the fact that the particle filter has assimilated LAI so we can assume that that term is unbiased for that case.