-
Notifications
You must be signed in to change notification settings - Fork 0
/
Quarto.qmd
696 lines (449 loc) · 49.5 KB
/
Quarto.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
---
editor: visual
title: THE VARIABILITY CLIMATE CHANGE IS RESPONSIBLE FOR IN VEGETATION LOSS IN GHANA
subtitle: Quantifying The Status of Galamsey With Time Series Analsis
date: "`r Sys.Date()`"
author:
- name: Kalong Boniface
- name: Fugah Seletey Mitchell
university: UNIVERSITY OF ENERGY AND NATURAL RESOURCES, SUNYANI
university-logo: Images/uenrlogo.png
university-logo-width: 5cm
format:
pdf:
# template: templates/templatex.tex
documentclass: report
classoption: ["onepage", "openany"]
number-sections: true
template-partials:
- "before-body.tex"
- "_titlepage.tex"
- "graphics.tex"
include-in-header:
- "in-header.tex"
toc: true
lof: true
lot: true
toc-depth: 3
toc-title: Table of Contents
code-block-bg: lightgray
geometry: left=1.4cm, top=.8cm, right=1.4cm, bottom=1.8cm, footskip=.5cm
titlepage-geometry:
- top=3in
- bottom=1in
- right=1in
- left=1in
highlight-style: pygments
bibliography: references.bib
citation: true
citecolor: blue
reference-location: block
link-citations: yes
tbl-colwidths: auto
tbl-cap-location: top
df-print: kable
---
# CHAPTER ONE
## INTRODUCTION
One would anticipate that the majority of emerging nations, which are still in the early stages of economic development and growth, would have a high forest cover and little deforestation. This, however, has not been the case. Ghana is a lower-middle-income nation that is still working toward middle-income classification. However, it has already begun to see a deforestation rate that is comparable to that of middle-income countries. The rapid population expansion, clearing of field for Galamsey operation,increased domestic need for wood for things like fuel, furniture, construction, and timber exports have all contributed to this trend, Bush fires in the 1980s, climate change, and lax law enforcement have all had an impact.
The purpose of this paper is to establish an understanding in time series analysis on remotely sensed data. Which will introduced us to the fundamentals of time series modelling, including decomposition, autocorrelation and modelling historical changes in **Vegetation Loss** in Ghana as a result of **Galamsey Operation** and the **Variability Climate Is Responsible For**, the Cause, Dangers and it's Environmental impact.
Galamsey also known as "*gather them and sell*",@owusu-nimo2018 is the term given by local Ghanaian for illegal small-scale gold mining in Ghana . The major cause of Galamsey is unemployment among the youth in Ghana @gracia2018 . Young university graduates rarely find work and when they do it hardly sustains them. The result is that these youth go the extra mile to earn a living for themselves and their family.
Another factor is that lack of job security. On November 13, 2009 a collapse occurred in an illegal, privately owned mine in Dompoase, in the Ashanti Region of Ghana. At least 18 workers were killed, including 13 women, who worked as porters for the miners. Officials described the disaster as the worst mine collapse in Ghanaian history @womendi2009 .
Illegal mining causes damage to the land and water supplym (@ansah2017 ) . In March 2017, the Minister of Lands and Natural Resources, Mr. John Peter Amewu, gave the Galamsey operators/illegal miners a three-week ultimatum to stop their activities or be prepared to face the law @allotey2017 . The activities by Galamseyers have depleted Ghana's forest cover and they have caused water pollution, due to the crude and unregulated nature of the mining process @gyekye .
Under current Ghanaian constitution, it is illegal to operate as galamseyer.That is to dig on land granted to mining companies as concessions or licenses and any other land in search for gold. In some cases, Galamseyers are the first to discover and work extensive gold deposits before mining companies find out and take over. Galamseyers are the main indicator of the presence of gold in free metallic dust form or they process oxide or sulfide gold ore using liquid mercury.
Between 20,000 to 50,000, including thousands from China are believed to be engaged in Galamsey in Ghana.But according to the Information Minister 200,000 and nearly 3 million people, recently are now into Galamsey operation and rely on it for their livelihoods @goldgu2017 . Their operations are mostly in the southern part of Ghana where it is believe to have substantial reserves of gold deposits, usually within the area of large mining companies @barenblitt2021 . As a group, they are economically disadvantaged. Galamsey settlements are usually poorer than neighboring agricultural villages. They have high rates of accidents and are exposed to mercury poisoning from their crude processing methods. Many women are among the workers, acting mostly as porters for the miners.
## Background of The Study
As Galamsey is considered an illegal activity, they operations are hidden to the eyes of the authorities.So locating them is quite tricky ,but with satellite imagery ,it now possible to locate their operating and put an end to it. One of the features of Google Earth Engine is the ability to access years of satellite imagery without needing to download, organize, store and process this information. For instance, within the Satellite image collection, now it possible to access imagery back to the 90's, allowing us to look at areas of interest on the map to visualize and quantify how much things has changed over time. With Earth Engine, Google maintains the data and offers it's computing power for processing.Users can now access hundreds of time series images and analyze changes across decades using GIS and R or other programming language to analyze these dataset.
### Problem Statement
The Footprint of Galamsey is Spreading at a very faster rate, causing vegetation loss.Other factors accounting to vegetation loss may largely include climate change,urban and exurban development, bush fires. But not much works or research has been done to tell the extent to which Galamsey causes vegetation loss and the **Variability Climate Change Is Responsible For**. This research attempts to segregate the variability climate is responsible for in vegetation loss so as to attribute the residual variability to Galamsey and other related activities such as bush-fires etc.
### Research Questions
To address the challenge of the vegetation variability in this work, the following several statements were formed:
- Are there any changes in vegetation cause by Galamsey and Climate change in Ghana?
- Is there any relationship between vegetation and land surface temperature in Ghana?
### Research Objectives
The purpose is to establish an understanding in time series analysis on remotely sensed data. We will be introduced to the fundamentals of time series modeling, including decomposition, autocorrelation and modeling historical changes.
- Perform time series analysis on satellite derived vegetation indices
- Estimate the extent to which Galamsey causes vegetation loss in Ghana.
- Dissociate or single out the variability climate is responsible for in vegetation loss
### Significance Of The Study
There have been significant changes in vegetation cover in Ghana over the past 30 years, and these dynamics are related strongly to climatic factors such as temperature and other factors. In this study, we want to examine the effects of climatic change on Ghana's vegetation during these thirty years.
This study allows us to explore climatic differences and climate-related drivers. Additionally, it offers a chance to research how climatic variability affects the ecosystem and human health. By merging climate and vegetation variation utilizing NDVI, LST, and EVI data to understand the relationship between vegetation and climate change under tropical climate conditions, it closes research gaps in Ghana. This study explores historical and projected vegetation and climate data, by sector, impacts, key vulnerabilities and what adaptation measures can be taken. It also explores the overview for a general context of how climate change is affecting **Ghana.**
### Scope of The Study
### Limitation Of The Study
The goal of time series modeling is to employ the simplest model feasible to account for as much data as possible while still developing an explanatory model of the data that does not over-fit the issue set.
Remote sensing data has additional limits that make this more difficult when dividing time series data into component pieces. It is almost certain that data from distant sensing will not provide the same level of precision.
Additionally, atmospheric factors can distort the visual findings, causing the vegetation's color to shift dramatically from image to image as a result of atmospheric factors (fog, ground moisture, cloud cover).
### Organization of The Study
# CHAPTER TWO
## LITERATURE REVIEW
The distribution of plant species, the richness and composition of plant communities, the structure of the vegetation (such as biomass and leaf area), and how the ecosystem uses water, nutrients, and carbon are all predicted to change as a result of climate change. These plant responses to climate change will be the outcome of numerous lower-level plant reactions, such as adjustments in net plant carbon uptake, adjustments in plant water usage, adjustments in plant growth and biomass allocation, competitive interactions, and reactions to disturbances. It is challenging to predict prospective plant reactions to future climatic changes based only on theory or on laboratory and field studies due to the complexity of climatic impacts on vegetation and the length of time it takes for the responses to become obvious.
To project vegetation responses to changing climates, computer simulation models that integrate theory and experimental results are frequently used, and the following are some studies that have been done previously where we reclassifies the drivers into human activity, and climate change for an empirical review and None Parametric Test followed by Time Series Analysis for the theoretical review .
### Empirical Review
According to studies, there is now significant change in vegetation on the earth than there was thirty years ago, and it is distributed differently.
More than half of the changes they found are attributed to the consequences of a warmer climate, with people only being responsible for about a third. Perhaps surprisingly, they are unable to definitively link approximately 10% of the changes to either the climate or us.\@alex2013
Several models and hypotheses have been established in the environmental literature to explain the relationship between human behaviour, and environmental (forest) deforestation or depletion. Recent environmental and energy economics literature focuses on the energy consumption choices made by businesses and people in developing countries @gertler2016 . Africa's energy supply is made mainly of fuel wood and charcoal to a degree of about 58%. @specht2015 . Before other demands for forest goods like furniture and paper, the need for fuel wood for cooking and heating is frequently identified as the main driver of deforestation.
The causes of tropical forest decline are unclear, according to @defries2010 . However, scientists were able to pinpoint the two primary causes of deforestation in the 21st century using information from satellite-based estimations in 41 different countries. The authors found a favorable association between forest loss and increases in urban population as well as agricultural exports using two methods of regression analysis. The same proof, however, was not discovered in the case of the increase in rural population. This suggests that forest loss is unavoidable in regions with high levels of human activity.
### Theoretical Review
This study review, will follow narrative approach to gain insight into research topics. A time series is a set of observations, each being recorded at a particular time and the collection of such observation is referred to as time series data. The data is analysed to extract statistical information, characteristics of the data and to predict the output. As the data might tend to follow a pattern in time series data, the Machine Learning model finds it difficult to predict appropriately hence time series analysis and its approaches have made it simpler for prediction.
**What is Time Series Analysis?**
A time series in mathematics is a collection of data points that have been listed, graphed, or indexed according to time. is a series of photos taken at successive, evenly spaced intervals of time. Time series are utilized in many areas of applied research that use temporal data, including statistics, signal processing, pattern identification, econometrics, mathematical finance, weather forecasting, and earthquake prediction. Time series analysis refers to techniques for deriving useful statistics and other aspects of time series data through analysis. Time series forecasting is the process of using a model to forecast future values based on values that have already been observed. Regression analysis is frequently used to assess correlations between one or more different time series, however this method of analysis is not without its limitations.
In 1987 @confiden , Makridakis and Hibon, time series analysis experts, held the M-Competition, where participants may submit their forecasts on 1001 time series data drawn from economics, industry, and demographics. The competition revealed four key findings, including:
- Statistically sophisticated or complex methods do not necessarily provide more accurate forecasts than simpler ones.
- The relative ranking of the performance of the various methods varies according to the accuracy measure being used.
- The accuracy when various methods are being combined outperformed, on average, the individual methods being combined and does very well in comparison to other methods.
- The accuracy of the various methods depends upon the length of the forecasting horizon involved.
The time series data is visualized and analyzed to find out mainly three things, trend, seasonality, and Heteroscedasticity.
**Trend:** It can be characterized as the observation of a rising or escalating pattern throughout time. While in normal time series the mean is an arbitrary function of time, in stationary time series the mean of the data must be constant across time.
**Seasonality:** This term describes a cycle of events. a pattern that, after some time, keeps happening.
**Heteroscedasticity:** It is also referred to as level, and it is described as the non-constant variance from the mean computed over time.
Few approaches do not perform well when trends are present in the data, and even fewer do not perform well when the data is seasonal. In order to choose the optimal statistical method for forecasting, trends, seasonality, and heteroscedasticity must be taken into account.
**Time Series Forecasting Using Stochastic Models**
The selection of a proper model is extremely important as it reflects the underlying structure of the series and this fitted model in turn is used for future forecasting. A time series model is said to be linear or non-linear depending on whether the current value of the series is a linear or non-linear function of past observations.
In general models for time series data can have many forms and represent different stochastic processes. There are two widely used linear time series models in literature.
*Autoregressive (AR)* and *Moving Average (MA)* models, combining these two, the Autoregressive Moving Average (ARMA) and *Autoregressive Integrated Moving Average (ARIMA)* models have been proposed in many literature. The *Autoregressive Fractionally Integrated Moving Average (ARFIMA)* model generalizes ARMA and ARIMA models. For seasonal time series forecasting, a variation of ARIMA. The *Seasonal Autoregressive Integrated Moving Average (SARIMA)* \] model is used.
ARIMA model and its different variations are based on the famous Box-Jenkins principle @hipel1994 and so these are also broadly known as the Box-Jenkins models.
Linear models have drawn much attention due to their relative simplicity in understanding and implementation. However many practical time series show non-linear patterns. For example, as mentioned by R. Parrelli in , non-linear models are appropriate for predicting volatility changes in economic and financial time series. Considering these facts, various non-linear models have been suggested in literature. Some of them are the famous Autoregressive Conditional Heteroskedasticity (ARCH) model and its variations like Generalized ARCH (GARCH) , Exponential Generalized ARCH (EGARCH) etc., the Threshold Autoregressive (TAR) model, the Non-linear Autoregressive (NAR) model, the Non-linear Moving Average (NMA) model, All the methods consider either of trend, seasonality, or heteroscedasticity to predict the future output. Time series data must be decomposed based on the findings from data analysis. Based on the findings from analysis data must be broken into trend or seasonality. @zhang1998
**Exponential Smoothing Models:**
Time-series data relies on the assumption that the observation at a certain point of time depends on previous observations in time . The previous observations are given weights as they contribute to the future prediction. The process of weighting is done using a notation called '**Theta**'. To find the best possible value for theta, we must perform sum of squared errors between the actual versus predicted value of the previous observation. Using this process, we can predict the next value but to predict more than one value this process does contribute much as the prediction as going to be same as the previous value.
To understand the methods and to evaluate different models, few concepts like *stationarity* and *differencing* must be understood. Both these concepts help in making the core concepts of the methods easy to interpret.
**Stationarity:**
Stationarity alludes to an irregular process that creates a time-series which has mean, and distribution to be constant through time. Distribution only depends on time and not location in time (Manuca, R. and Savit, R., 1996). If the distribution is same over different time windows is strong stationarity and if only mean and variance are similar, then it is weak stationarity. Irrespective of strong or weak, stationarity helps build a class of models such Autoregression (AR), Moving Average (MA), ARIMA (Witt, A., Kurths, J. and Pikovsky, A., 1998).
An MA(q) process is always stationary, irrespective of the values the MA parameters \[23\]. The conditions regarding stationarity and invertibility of AR and MA processes also hold for an ARMA process. An ARMA(p, q) process is stationary if all the roots of the characteristic equation $\phi (L) = 0$ lie outside the unit circle. Similarly, if all the roots of the lag equation
$\theta (L) = 0$ lie outside the unit circle, then the ARMA(p, q) process is invertible and can be expressed as a pure AR process..
**Differencing:**
This concept is used to make trending and seasonal data stationary. Subtraction between current observation and previous observation is the process of differencing. It helps in making the mean constant (Dickey, D.A. and Pantula, S.G., 1987).
**Autoregressive models (AR):**
AR work on a concept called lags which is defined as the forecast of a series is solely based on the past values in the series (Cryer, J.D., 1986). Formula for Autoregression AR(1): $\displaystyle y_{t} = \omega + \phi Y _{t-1}+ e_{t}$ is stationary when \$\|\phi\_{1}\|\< 1\$ with a constant mean $\displaystyle \mu = \frac{\omega}{1-\phi_{1}}$ and constant variance $\displaystyle \gamma_{o} = \frac{\sigma^{2}}{1-\phi_{1}^{2}}$ Where ; $y_{t}$ = Target , $\omega$ = Intercept, $\phi$ = Coefficient, $Y_{t-1}$=Lagged target, $e_{t}$ = Error\\
It depends only on one lag in the past and also called AR model of order one (Shibata, R., 1976). Autoregressive models are also known as long memory models as they must keep the memory of all the lags until its initial start point and must calculate their value. If there is any shock incident in the past which must have led to fluctuations in the data, it will have its effect on the present value which makes the model quite sensitive to shocks (Shibata, R., 1976).
**Moving Average (MA):**
The moving average model forecasts a series based on the past error in the series called error lags. Hunter, J.S., Formula for moving average method is given as: $y_{t} = \omega + \theta e _{t-1}+ e_{t}$
In (2), all the abbreviations are same to AR model formula except, = Previous error
There arises a question as this method uses the error for the previous value but when it reaches to the first point there will be no previous value, to overcome this the average of the series is considered as the value before the starting point. These are short memory models as the error in the past will not have much effect on the future value (Hunter, J.S., 1986).
**Comparing AR method with MA method:**
Let focus on the two methods which were used in the early years of time series forecasting and compare the performance of each model on a particular task. Testing against general autoregressive and moving average error models where the regressors include lagged dependent variables. (Godfrey, L.G., 1978) In their paper have explained the order of the error process under the alternate hypothesis using lagrange multiplier test (Silvey, S.D., 1959). As per the tests the errors of both the models were similar, but the constraints were different under which the tests were performed are also to be considered. As they have concluded in their paper stating that that the outcome of the model's performance depends on the estimate chosen to be null hypothesis or alternate hypothesis.
In addition, paper written by (Baltagi, B.H. and Li, Q., 1995), Demonstrates the comparison of AR and MA model using Burke, Godfrey, and Termayne test. To the error component model. They explained choosing of this test is because these are simple to implement as they only require within residual or OLS residual (Baltagi, B.H. and Li, Q., 1995). The outcome of the experiment was explained as when the test used within residual AR model performed well but had problems, if the test used OLS residual MA model performance was good. They have concluded stating that MA will performance much better when the parameters are changed.
The findings of both the paper were quite different but one cannot prove either of the model to be better as the performance depends on the parameters used in the model. Each model is unique to its use case, and it depends on the user to choose accordingly based on the data.
**Autocorrelation and Partial Autocorrelation Functions (ACF and PACF)**
To determine a proper model for a given time series data, it is necessary to carry out the ACF and PACF analysis. These statistical measures reflect how the observations in a time series are related to each other. For modeling and forecasting purpose it is often useful to plot the ACF and PACF against consecutive time lags. These plots help in determining the order of AR and MA terms. For a time series ${x(t),t = 0,1, 2,...}$the Autocovariance \[21, 23\] at lag k is defined as:
$\mu$ is the mean of the time series, i.e. $\mu = E\left[x_{t}\right]$. The autocovariance at lag zero i.e. $y_{0}$ is the variance of the time series. From the definition it is clear that the autocorrelation coefficient \$ p\_{k}\$ is dimensionless and so is independent of the scale of measurement. Also, clearly $-1 \leq p\_{k}\leq 1$. Statisticians Box and Jenkins \[6\] termed $y_{k }$ as the theoretical Autocovariance Function (ACVF) and $p_{k}$ as the theoretical Autocorrelation Function (ACF).
Another measure, known as the Partial Autucorrelation Function (PACF) is used to measure the correlation between an observation k period ago and the current observation, after controlling for observations at intermediate lags (i.e. at lags \< k ) \[12\]. At lag 1, PACF(1) is same as ACF(1). The detailed formulae for calculating PACF are given in \[6, 23\].
Normally, the stochastic process governing a time series is unknown and so it is not possible to determine the actual or theoretical ACF and PACF values. Rather these values are to be estimated from the training data, i.e. the known time series at hand. The estimated ACF and PACF values from the training data are respectively termed as sample ACF and PACF \[6, 23\].
As given in \[23\], the most appropriate sample estimate for the ACVF at lag k is ACF plot is useful in determining the type of model to fit to a time series of length N. Since ACF is symmetrical about lag zero, it is only required to plot the sample ACF for positive lags, from lag one on-wards to a maximum lag of about N/4. The sample PACF plot helps in identifying the maximum order of an AR process.
**Autoregressive Moving Average (ARMA) model:**
ARMA model is a combination of AR and MA models. The equation of the AR model of order one, when it reaches to the starting point will have infinite moving average (Choi, B., 2012). In ARMA model p and q have to defined, where p = number of significant terms in ACF and q = number of significant terms in PACF.
To determine the optimal value for p and q there are two ways:
- Plotting patterns in correlation
- Automatic selection techniques
**Plotting patterns in correlation:**
[Auto correlation factor (ACF):]{.underline}
It is the correlation between the observations at the current time stamp and observations at the previous time stamp (Hagan, M.T. and Behr, S.M., 1987).
[Partial auto correlation factor (PACF):]{.underline}
The correlation between the observations at two different time stamps, assuming both observations are correlated to the observations at another time stamp (Hagan, M.T. and Behr, S.M., 1987).
[Automatic selection techniques:]{.underline}
There are three commonly used techniques for automatic selection of time series model:
######
- *Minimum info criteria (MINIC):*This builds multiple combinations of models across a grid search of AR and MA terms. It then finds the model with lowest Bayesian information criteria (Stadnytska, T., Braun, S. and Werner, J., 2008).
- *Squared canonical correlations (SCAN):* It looks at correlation matrix of the data, then it compares it with its lags. It then looks at the eigen values from the correlation matrix to find the combination of AR and MA probably having SCAN as 0. It finds the pair as the best where the convergence is quickest (Stadnytska, T., Braun, S. and Werner, J., 2008).
- *The extended sample auto correlation function (ESACF):* As it is known that AR and MA are related. Essentially it filters out the AR terms until only MA piece is left. This process is repeated until fewest AR terms are left and maximum MA terms (Stadnytska, T., Braun, S. and Werner, J., 2008).
It completely depends on the individual to choose from either of the methods helping them to find the optimal value of p and q for better performance of the model.
**Autoregressive Integrated Moving average (ARIMA):**
To understand ARIMA model, we need to understand ARMA model as this is just an extension to ARMA model. Essentially, we need to make data stationary to feed it to a machine learning model. It is done by through differencing. ARIMA models are mathematically written as ***ARIMA(p,d,q)***, where p and q are same as ARMA model but ***d*** = number of first differences (Yu, G. and Zhang, C., 2004, May).
**Seasonal Autoregressive Integrated Moving Average (SARIMA):**
SARIMA models were introduced to handle seasonality in the data. Seasonality is different from stationarity; however, seasonality can be handled using stationarity up to some extent, but seasonal correlations cannot be eliminated completely. SARIMA models are mathematically written as SARIMA$(p,d,q)(P,D,Q)^{s}$.
Where;
P = Number of seasonal AR terms, D = Number of seasonal differences, Q = Number of seasonal MA terms, s = Length of the season.
Removing seasonality will help the model to perform better but getting rid of seasonality in data is a difficult task to do.
**Comparing ARIMA method with SARIMA method:**
@library2015 investigated it on long-term runoff forecasting in the United States in contrast to ARIMA and SARIMA. The outcomes demonstrated that SARIMA models outperformed ARIMA models. However, it was discovered that SARIMA models were extremely sensitive, and even a small change in a parameter would have a negative impact on the model's performance. ARIMA and SARIMA models have been applied by (Wang, S., Li, C., and Lim, 2019) from the perspectives of Linear System Analysis, Spectra Analysis, and Digital Filtering. The researchers were obliged to go outside of these models for greater performance after it was established that ARIMA and SARIMA both had poor performance.Although they claimed the ARMA-SIN model was superior to the ARIMA and SARIMA models, they also acknowledged that it was more challenging to study and comprehend the ideas.
The results of @library2015 have demonstrated that SARIMA is superior; however, their assertion is incongruent when it is contrasted with the results of (Wang, S., Li, C. and Lim, A., 2019).
The choice of a particular approach must be based on the data; following analysis, it is known whether the data have a trend, in which case ARIMA must be used, or whether they are seasonal, in which case SARIMA would be beneficial.
[[**ADVANTAGES AND DISADVANTAGES OF TIME SERIES FORECASTING**]{.underline}]{.smallcaps}
**Advantages of time series forecasting:**
- Time series forecasting is of high accuracy and simplicity.
- It can be used to analyze how the changes associated with the data point picked correlate with changes in other variables during the same time span.
- Statistical techniques have been developed to analyze time series in such a way that the factor that influences the fluctuation of the series may be identified and handled.
- It can give good output with less variables. As regression models fail with less variables, time series models will work better and effectively.
**Disadvantages of time series forecasting:**
- Time series models can easily be overfitted, which lead to false results.
- It works well with short term forecasting but does not work well with long term forecasting.
- It is sensible to outliers, if the outliers are not handled properly then it could lead to wrong predictions.
- The different elements that impact the fluctuations of a series cannot be fully adjusted by the time series analysis
**Forecast Performance Measures**
While applying a particular model to some real or simulated time series, first the raw data is divided into two parts (**Training Set and Test Set**). The observations in the training set are used for constructing the desired model. Often a small sub-part of the training set is kept for validation purpose and is known as the **Validation Set**. Sometimes a preprocessing is done by normalizing the data or taking logarithmic or other transforms. One such famous technique is the Box-Cox Transformation \[23\]. Once a model is constructed, it is used for generating forecasts. The test set observations are kept for verifying how accurate the fitted model performed in forecasting these values. If necessary, an inverse transformation is applied on the forecast values to convert them in original scale. In order to judge the forecasting accuracy of a particular model or for evaluating and comparing different models, their relative performance on the test dataset is considered.
Due to the fundamental importance of time series forecasting in many practical situations, proper care should be taken while selecting a particular model. For this reason, various performance measures are proposed in literature \[3, 7, 8, 9, 24, 27\] to estimate forecast accuracy and to compare different models. These are also known as performance metrics \[24\]. Each of these measures is a function of the actual and forecast values of the time series.
**Description of Various Forecast Performance Measures**
In each of the forthcoming definitions, $y_{t }$ is the actual value, $f_{t}$ is the forecast value, $e_{t} = y_{t} - f_{t}$ is the forecast error and n is the size of the test set. Also, $\displaystyle \bar{y} = \frac{1}{n}\sum_{t=1}^{n}y_{t}$ is the test mean and $\displaystyle \sigma^{2} = \frac{1}{n-1}\sum_{t=1}^{n}(y_{t}-\bar{y})^{2}$is the test variance.
**The Mean Forecast Error** $\displaystyle MFE = \frac{1}{n}\sum_{t=1}^{n}e_{t}$
• It is a measure of the average deviation of forecast values from actual ones.
• It shows the direction of error and thus also termed as the Forecast Bias.
• In MFE, the effects of positive and negative errors cancel out and there is no way to
know their exact amount.
• A zero MFE does not mean that forecasts are perfect, i.e. contain no error; rather it only
indicates that forecasts are on proper target.
• MFE does not panellize extreme errors.
• It depends on the scale of measurement and also affected by data transformations.
• For a good forecast, i.e. to have a minimum bias, it is desirable that the MFE is as close
to zero as possible.
The Mean Absolute Error $\displaystyle MAE = \frac{1}{n}\sum_{t=1}^{n}|e_{t}|$
- It measures the average absolute deviation of forecast values from original ones.
-It is also termed as the Mean Absolute Deviation (MAD).
- It shows the magnitude of overall error, occurred due to forecasting.
- In MAE, the effects of positive and negative errors do not cancel out.
- Unlike MFE, MAE does not provide any idea about the direction of errors.
- For a good forecast, the obtained MAE should be as small as possible.
- Like MFE, MAE also depends on the scale of measurement and data transformations.
- Extreme forecast errors are not panelized by MAE.
The Mean Squared Error $\displaystyle MSE = \frac{1}{n}\sum_{t=1}^{n}e^{2}_{t}$
- It is a measure of average squared deviation of forecast values.
- As here the opposite signed errors do not offset one another, MSE gives an overall idea of the error occurred during forecasting.
- It panelizes extreme errors occurred while forecasting.
- MSE emphasizes the fact that the total forecast error is in fact much affected by large
- individual errors, i.e. large errors are much expensive than small errors.
- MSE does not provide any idea about the direction of overall error.
- MSE is sensitive to the change of scale and data transformations.
- Although MSE is a good measure of overall forecast error, but it is not as intuitive and
- easily interpretable as the other measures discussed before.
The Root Mean Squared Error $\displaystyle RMSE = \sqrt{MSE} = \sqrt {\frac{1}{n}\sum_{t=1}^{n}e^{2}_{t}}$
RMSE is nothing but the square root of calculated MSE.
All the properties of MSE hold for RMSE as well.
```{r,warning=FALSE,message=FALSE,include=FALSE}
#| label: load-pkgs
#| code-summary: "Packages"
#| message: false
library(openintro) # for data
library(tidyverse) # for data wrangling and visualization
library(knitr) # for tables
library(broom) # for model summary
library(imputeTS)
library(dplyr)
library(kableExtra)
library(forecast)
library(psych)
library(viridis)
library(ggridges)
library('sf')
library(tibble)
library(lubridate)
# if(!require("pacman")){install.packages("pacman")}
# pacman::p_load(char = c('rgee','reticulate','raster','tidyverse',
# 'dplyr','sf','forcats','reticulate',
# 'rgee', 'tibble', 'st', 'lubridate', 'imputeTS','leaflet', 'ggplot2'),
# install = F, update = F, character.only = T)
```
# CHAPTER THREE
## METHODOLOGY
Data from a time series is a set of observations made in a particular order over a period of time. There is a chance for correlation between observations because time series data points are gathered at close intervals. To help machine learning classifiers work with time series data, we provide several new tools. We first contend that local features or patterns in time series can be found and combined to address challenges involving time-series categorization. Then, a method to discover patterns that are helpful for classification is suggested. We combine these patterns to create computable categorization rules. In order to mask low-quality pixels, we will first collect data from Google Earth Engine in order to choose NDVI, EVI values and Climate Change data.
Instead of analyzing the imagery directly, we will summarize the mean NDVI and EVI values. This will shorten the analysis time while still providing an attractive and useful map. We will apply a smoothing strategy using an ARIMA function to fix the situation where some cells may not have NDVI and EVI for a particular month. Once NA values have been eliminated, the time series will be divided to eliminate seasonality before the normalized data is fitted using a linear model. We will go to classify our data on the map and map it after we have extracted the linear trend.
## Research Design
In this study, the submission used a quantitative approach. Instead of using subjective judgment, findings and conclusions heavily rely on the use of statistical methods and reliable time series models.
### Data Representation
The Republic of Ghana, a nation in West Africa, will serve as the location for the experimental plots for this study. It shares borders with the Ivory Coast in the west, Burkina Faso in the north, and Togo in the east. It borders the Gulf of Guinea and the Atlantic Ocean to the south. Ghana's total size is 238,535 km2 (92,099 sq mi), and it is made up of a variety of biomes, from tropical rainforests to coastal savannas. Ghana, which has a population of over 31 million, is the second-most populous nation in West Africa, behind Nigeria.Accra, the nation's capital and largest city, as well as Kumasi, Tamale, and Sekondi-Takoradi, are other important cities.
### Assumptions
```{r,warning=FALSE,message=FALSE}
# | label: tbl-Data Frame
# | tbl-cap: "Collected from Google Earth Engine"
Data_Frame <- read.csv("Data/Time_Series.csv")
Time_Serie <- read.csv("Data/Time_Series.csv")%>%
select(year,NDVI,EVI,Precipitation,MinTemperature,MaxTemperature)%>%
group_by(year)%>%
summarise_each(funs(median))
kable(Time_Serie,longtable = T, booktabs = T)%>%
add_header_above(c(" ","Vegetation Indices" = 2,"Climate Change"= 3))%>%
kable_styling(latex_options = c("repeat_header"))
```
### Exploratory Data Analysis (Summary statistics)
```{r,message=FALSE,warning=FALSE}
#| label: tbl-Summar Statistics
#| tbl-cap: "Summary statistics for Climate Date and Vegetation Loss In Ghana"
Describe <- describe(Time_Serie%>%select(-year))
kable(Describe,longtable = T, booktabs = T)%>%
kable_styling(latex_options = "scale_down")
```
```{r,warning=FALSE,include=FALSE}
# mpg_list <- split(Time_Serie$EVI,Time_Serie$Precipitation)
# disp_list <- split(Time_Serie$NDVI,Time_Serie$Precipitation )
# inline_plot <- data.frame(year = c(2002, 2010, 2018),
# mpg_box = "",mpg_hist = "",mpg_line1 = "", mpg_line2 = "",
# mpg_points1 = "", mpg_points2 = "", mpg_poly = "")
#
# inline_plot %>%
# kbl(booktabs = TRUE) %>%
# kable_paper(full_width = FALSE) %>%
# column_spec(2, image = spec_boxplot(mpg_list)) %>%
# column_spec(3, image = spec_hist(mpg_list)) %>%
# column_spec(4, image = spec_plot(mpg_list, same_lim = TRUE)) %>%
# column_spec(5, image = spec_plot(mpg_list, same_lim = FALSE)) %>%
# column_spec(6, image = spec_plot(mpg_list, type = "p")) %>%
# column_spec(7, image = spec_plot(mpg_list, disp_list, type = "p")) %>%
# column_spec(8, image = spec_plot(mpg_list, polymin = 5))
```
```{r}
#| label: fig-Pairs Plot
#| fig-cap: "Correlation Between The Variables"
pairs(Time_Serie,bg = c("red", "green", "blue"),pch = 21)
```
```{r}
summary(lm(EVI~Precipitation+MinTemperature+MaxTemperature,Time_Serie))
```
```{r}
#| label: tbl-Analysis of Variance Table
#| tbl-cap: "ANOVA Table for Climate Date and Vegetation Loss In Ghana"
lm<-lm(EVI~ Precipitation + MinTemperature +MaxTemperature,Time_Serie)
kable(anova(lm),booktab = T) %>%
kable_styling(latex_options = c("repeat_header"))
```
### Non-Parametric Analysis
### **Time-series analysis**
[**Steps involved in Box-Jenkins approach**]{.underline}
![FlowChart](Images/Steps.png "METHADOLOGY Steps involved in Box-Jenkins approach"){alt="FlowChart" fig-align="center"}
# CHAPTER FOUR
## Analysis and Finding
```{r,include=FALSE}
#| label: fig-Time Series And Decompostion
#| fig-cap: "Time Series And Decompostion"
# Convert data to time series.
Time_Series <- ts(data = Time_Serie$EVI, start = c(2001, 1), end = c(2019, 11), frequency = 12)
plot(Time_Series)
plot(Time_Series)
tdx.dcp <- stl(Time_Series, s.window = 'periodic')
plot(tdx.dcp)
Tt <- trendcycle(tdx.dcp)
St <- seasonal(tdx.dcp)
Rt <- remainder(tdx.dcp)
plot(Rt)
```
Before building an ARIMA model we checked that if the series is stationary. That is, we needed to be determined that the time series is constant in mean and variance are constant and not dependent on time.Here, we look at a couple of methods for checking stationarity. If the time series is provided with seasonarity, a trend, or a change point in the mean or variance, then the influences need to be removed or accounted for. Augmented Dickey--Fuller (ADF) t-statistic test to find if the series has a unit root (a series with a trend line will have a unit root and result in a large p-value).
```{r}
#| label: fig-ACF
#| fig-cap: "ACF Plot and PACF plot analysis for sample between 2000 and 2020:"
#| fig-subcap:
#| - "Stationary Signal"
#| - "Trend Signal"
#| layout-ncol: 2
#| column: page-right
# The Stationary Signal and ACF
plot(Rt,col= "red", main = "Stationary Signal")
acf(Rt, lag.max = length(Rt),
xlab = "lag", ylab = 'ACF', main = '')
#The Trend Signal anf ACF
plot(Tt,col= "red",main = "Trend Signal")
acf(Tt, lag.max = length(Tt),
xlab = "lag", ylab = "ACF", main = '')
```
**Discuss:**Shows the initial ACF plot and we can see that before lag 25 almost all are significant and having no trend it needs to be differentiated before performing any analysis. Clearly the seasonality is visible even in the ACF plot.
**Dickey-Fuller Test and Plot**
```{r,warning=FALSE,message=FALSE}
tseries::adf.test(Tt)
```
**Discuss:**The DF test confirms that it is stationary as p value \< 0.05 and thus can be used for further analysis.This is after doing double differentiation.It is noteworthy that the stationary signal (top left) generates few significant lags that are larger than the ACF's confidence interval (blue dotted line, bottom left). In contrast, practically all delays in the time series with a trend (top right) surpass the ACF's confidence range (bottom right). Qualitatively, we can observe and infer from the ACFs that the signal on the left is steady (due to the lags that die out) whereas the signal on the right is not (since later lags exceed the confidence interval).
### Specification of the Model
We can create the SARMA model as SARMA(0,0,0) based on the previous study
If there hasn't been any differentiation, we can label it as zero. With the first parameter being PACF and the second being ACF, the first component of multiplication is the non-seasonal part.
The SARMA model's seasonal component follows a similar approach. Since this model has a larger value than the prior SARMA model, it cannot be used. We can utilize the GARCH model and test to see whether the AIC value is better along with the ARMA model as well by omitting the seasonal element as the seasonality pattern is not guaranteed. Generalized Auto-Regressive Conditional Heteroskedasticity models, or GARCH models, can be abbreviated. The GARCH model is commonly used, to estimate value returns for stocks and other financial instruments where trends are unknown. In order to improved the AIC values, we are testing in our case study. Utilizing rugarch, we'll apply the seasonal ARMA-GARCH model.\
Since the data is stationary, we go about finding the p and q values from ACF and PACF plots or use auto.arima() in R.
```{r}
auto.arima(Time_Series)
```
**Discuss:**As we are not differencing the model we can consider ARMA(2,0,3) has the best model. Which is the best *p* and *q* value also found from the ACF and PACF plots.
**Residual Analysis**
**Discuss:**From the above time series plot we can conclude that, the trend within the year values for 1960,2016 and 2020 are similar. We can observe that during start of the year in January the unemployment rate increases and becomes constant during February, March and then decreases sharply post April. Then in mid of the year it increases to a certain level and attains constant until late/end of the year. Clearly we can see some pattern when we do time series plot within a single year. It can be concluded that unemployment rate is higher during winter months and decreased post April which is summer season. Thus the seasonal aspect can be clearly understood.\
### **Modeling and Parameter estimation**
\
Where the **ARIMA (PACF, Num_Diffrentation, ACF)** model have the below format for the parameters. Coefficients for various models:
**Discuss:**Based on the different models, we can see that ARIMA(2,2,5) had the least AIC value, sigma\^2 being the least therefore is the best model for given time series. Find the below time series plot for the residuals.
**Residual Analysis**
**Residual Plot**
**Shapiro Test**
**Ljung-Box**
**Time-series Forecasting**
**Discuss:**The plot shows the forecasting to plot for the next 20 values which is shown by the blue region.
```{r}
#| label: tbl-lm
#| tbl-cap: "Linear regression model for predicting EVI from Time"
tdx.ns <- data.frame(time = c(1:length(Time_Series)), trend = Time_Series - tdx.dcp$time.series[,1])
summary <- summary(lm(formula = trend ~ time, data = tdx.ns))
summary
```
```{r}
plot(tdx.ns)
abline(a = summary$coefficients[1,1], b = summary$coefficients[2,1], col = 'blue')
```
```{r,warning=FALSE,include=FALSE}
library(ggpubr)
ggdensity(Time_Series,fill = "#0073C2FF",color ="#0073C2FF",add = "mean",rug = TRUE)
```
```{r}
plot(evi.hw <- forecast::hw(y = Time_Series, h = 12, damped = T))
```
VAR
```{r,warning=FALSE}
require(tidyverse)
require(tidymodels)
require(data.table)
require(tidyposterior)
require(tsibble) #tsibble for time series based on tidy principles
require(fable) #for forecasting based on tidy principles
require(ggfortify) #for plotting timeseries
require(forecast) #for forecast function
require(tseries)
require(chron)
require(lubridate)
require(directlabels)
require(zoo)
require(lmtest)
require(TTR) #for smoothing the time series
require(MTS)
require(vars)
require(fUnitRoots)
require(lattice)
require(grid)
```
```{r}
# Conerting the Data into time series data
Var_ts <- ts(
Time_Serie
)
head(Var_ts)
```
```{r}
plot(Var_ts)
```
```{r}
theme_set(theme_bw())
autoplot(Var_ts) +
ggtitle("Time Series Plot of the `Var_ts' Time-Series") +
theme(plot.title = element_text(hjust = 0.5)) #for centering the text
```
```{r}
plot.ts(Var_ts)
```
```{r}
# Lag order identification
#We will use two different functions, from two different packages to identify the lag order for the VAR model. Both functions are quite similar to each other but differ in the output they produce. vars::VAR is a more powerful and convinient function to identify the correct lag order.
vars::VARselect(Var_ts,
type = "none", #type of deterministic regressors to include. We use none becasue the time series was made stationary using differencing above.
lag.max = 10) #highest lag order
```
```{r}
# Creating a VAR model with vars
var.a <- vars::VAR(Var_ts,
lag.max = 1, #highest lag order for lag length selection according to the choosen ic
ic = "AIC", #information criterion
type = "none") #type of deterministic regressors to include
summary(var.a)
```
```{r}
```
# CHAPTER FIVE
## CONCLUSIONS AND RECOMMENDATIONS
### Summary
Broadly speaking, in this study we have presented a state-of-the-art of the following popular time series forecasting models with their salient features:
- The Box-Jenkins or ARIMA models for linear time series forecasting.
- Some non-linear stochastic models, such as NMA, ARCH.
- SVM based forecasting models; LS-SVM and DLS-SVM.
### Conclusions
It has been seen that, the proper selection of the model orders (in case of ARIMA), the number of input, hidden, output and the constant hyper-parameters (in case of SVM) is extremely crucial for successful forecasting. We have discussed the two important functions. AIC and BIC, which are frequently used for ARIMA model selection.
We have considered a few important performance measures for evaluating the accuracy of forecasting models. It has been understood that for obtaining a reasonable knowledge about the overall forecasting error, more than one measure should be used in practice. The last chapter contains the forecasting results of our experiments, performed on six real time series datasets. Our satisfactory understanding about the considered forecasting models and their successful implementation can be observed form the five performance measures and the forecast diagrams, we obtained for each of the six datasets. However in some cases, significant deviation can be seen among the original observations and our forecast values. In such cases, we can suggest that a suitable data preprocessing, other than those we have used in our work may improve the forecast performances.
### Recommendations
Time series forecasting is a fast growing area of research and as such provides many scope for future works. One of them is the Combining Approach, i.e. to combine a number of different and dissimilar methods to improve forecast accuracy. A lot of works have been done towards this direction and various combining methods have been proposed in literature \[8, 14, 15, 16\]. Together with other analysis in time series forecasting, we have thought to find an efficient combining model, in future if possible. With the aim of further studies in time series modeling and forecasting
# References