library(readxl)
library(ggplot2)
Notes on statistical inference made for learning statistics for data scientists
Assumptions:
- Random Sampling from DGP
- Stability in DGP
- If n< 30 => Then assume normal distribution
Congress <- read_excel("Piracy.xlsx", col_names = TRUE)
mean = 30.7833
sd = 1.7862
size = 12
sample_sd = sd/sqrt(size)
lower = mean + qt(0.05, size-1)*sample_sd
higher =mean -qt(0.05, size-1)*sample_sd
lower
## [1] 29.85729
higher
## [1] 31.70931
Assumptions:
- Random Sampling from DGP
- Stability in DGP
poll <- read_excel('Poll.xlsx', col_names = TRUE)
table(poll)
## poll
## 1 2
## 358 407
prop.test can be used for testing the null that the proportions (probabilities of success) in several groups are the same, or that they equal certain given values.
prop.test(358, (358+407), conf.level = .95)
##
## 1-sample proportions test with continuity correction
##
## data: 358 out of (358 + 407), null probability 0.5
## X-squared = 3.0118, df = 1, p-value = 0.08266
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.4322160 0.5040576
## sample estimates:
## p
## 0.4679739
Performs an exact test of a simple null hypothesis about the probability of success in a Bernoulli experiment. Here hypothesized probability of success is provided as parameter. 0.5 in this case
binom.test(358, (358+407), 0.5, conf.level = .9)
##
## Exact binomial test
##
## data: 358 and (358 + 407)
## number of successes = 358, number of trials = 765, p-value =
## 0.08259
## alternative hypothesis: true probability of success is not equal to 0.5
## 90 percent confidence interval:
## 0.4377759 0.4983554
## sample estimates:
## probability of success
## 0.4679739
The file WaitTime.xlsx contains data of wait times (in minutes) for a random sample of a bank’s customers. Find a 90% confidence interval for the mean wait time at the bank.
t.test - Performs one and two sample t-tests on vectors of data and also give CI. There are two kinds of hypotheses for a one sample t-test, the null hypothesis and the alternative hypothesis. The alternative hypothesis assumes that some difference exists between the true mean (μ) and the comparison value (m0), whereas the null hypothesis assumes that no difference exists. The comparison value (m0) can be specified with mu parameter below, default is 0. The value of mu does not affect the CI, as CI is the estimation of true mean using leveraging the given sample.
waiting = read_excel('WaitTime.xlsx')
t.test(waiting$WaitTime, mu = 0, conf.level = .90)
##
## One Sample t-test
##
## data: waiting$WaitTime
## t = 22.057, df = 99, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
## 5.048978 5.871022
## sample estimates:
## mean of x
## 5.46
t.test(waiting$WaitTime, mu = mean(waiting$WaitTime), conf.level = .90)
##
## One Sample t-test
##
## data: waiting$WaitTime
## t = 0, df = 99, p-value = 1
## alternative hypothesis: true mean is not equal to 5.46
## 90 percent confidence interval:
## 5.048978 5.871022
## sample estimates:
## mean of x
## 5.46
An equivalent alternative to t.test() is manual calculation as done below
sample_mean = mean(waiting$WaitTime)
sample_sd = sd(waiting$WaitTime)
sampled_sd = sample_sd/sqrt(length(waiting$WaitTime))
sample_mean+qt(0.05, length(waiting$WaitTime-1))*sampled_sd
## [1] 5.049016
sample_mean-qt(0.05, length(waiting$WaitTime-1))*sampled_sd
## [1] 5.870984
- *Interpretation**: There is a 90% chance that the interval we got captures the true mean waiting time*
Type 1
- Incorrectly rejecting H0 when its actually true
- Hence, you end up accpeting Ha when infact its not true
- Type I error is to falsely infer the existence of something that is not there [conforming to common belief with false information]
- False positive finding
- It is asserting something that is absent, a false hit
- The probability of marking Type I error is called α or Significan level
- α is the probability of the study rejecting the null hypothesis, given that it were true
- In statistical hypothesis testing, a result has statistical significance when p < α
Type 2 * Failure to reject H0 when its actually false * Hence, you end up not accepting Ha when infact its true * Type II error is to falsely infer the absence of something that is present [going against the common belief with false information] * False negative finding * It is failing to assert what is present, a miss * The probability of marking Type II error is called β and and related to the power of a test (which equals 1−β). * β is the probability of incorrectly concluding no statistical significance
Example: * H0 of a fire alarm, is that there is no fire * Ha of a fire alarm, is that there is a fire * Type I -> Fire alarm going on indicating a fire when in fact there is no fire * Type II -> A fire breaking out and the fire alarm does not ring;
Accepting alternative hypothesis when null hypothesis is true -> Type 2
The probability of marking Type 2 error is Beta (failure in rejecting Ho)
The closer is true mean to the hypothetical mean, the higher the value of Beta. We can only control Type 1 error through desired level of significance. Hence, design hypothesis such that the error we want to control is the Type 1 error. ** Statisticians Dodge: ** Since we cannot control Type 2 error. (ie. Accepting Ho when Ha is true). Never draw the conclusion of accepting Ho, instead what we can conclude is that: 'we do not reject Ho) ie: Failed to reject the Null Hypothesis P value is the probability of getting the sample result if Ho were true
The one sample t-test is a statistical procedure used to determine whether a sample of observations could have been generated by a process with a specific mean. Suppose you are interested in determining whether an waiting time at a clinic is less then 5 min. To test this hypothesis, you could collect a sample of waiting times, measure their weights, and compare the sample with a value of five using a one-sample t-test.
Manually calculating T value using the formula $$ t = \frac{(\text{mean}_f - \text{mean}_m) - \text{expected difference}}{SE} \\ ~\\ ~\\ SE = \sqrt{\frac{sd_f^2}{n_f} + \frac{sd_m^2}{n_m}} \\ ~\\ ~\\ \text{where, }~~~df = n_m + n_f - 2 $$ P value of a T statistic can then be calculated as
se = sd(waiting$WaitTime)/sqrt(length(waiting$WaitTime))
t= (mean(waiting$WaitTime)-5)/se
df=length(waiting$WaitTime)-1
pt(t, df)
## [1] 0.9669472
Assumptions:
- Random Sampling from DGP
- Stability in DGP
- If n< 30 => Then assume normal distribution
- Null hypothesis is true
Test if the waiting time is less than 5 min Null hypothesis -> true mean is greater then and equal to 5 Alt. hypothesis -> true mean is less than 5
t.test(waiting$WaitTime, alternative = 'less', mu=5)
##
## One Sample t-test
##
## data: waiting$WaitTime
## t = 1.8582, df = 99, p-value = 0.9669
## alternative hypothesis: true mean is less than 5
## 95 percent confidence interval:
## -Inf 5.871022
## sample estimates:
## mean of x
## 5.46
The likely hood of getting sample statistic as extreme as 1.85 in 0.966 if the true mean waiting time is greater then or equal to 5
The significance level is the probability of obtaining a result as extreme as, or more extreme than, the result actually obtained when the null hypothesis is true
The significance level and confidence level are the complementary portions in the normal distribution.
Test normality for generated numbers
Assumptions:
- Random Sampling from DGP
- Stability in DGP
seq <- rnorm(50,5,1)
hist(seq)
qqnorm(seq)
qqline(seq, col = "blue")
Null hypothesis -> distribution is normal
Alt. hypothesis -> distribution is not normal
shapiro.test(seq)
##
## Shapiro-Wilk normality test
##
## data: seq
## W = 0.97832, p-value = 0.4833
P-value being so high, we can conclude that: We fail to reject null hypothesis. Hence, our assumption of distribution being normal is not rejected.
Definition of P-Value: Given a null hypothesis and sample evidence of size n, the p-value is the probability of getting a sample evidence that is equally or more unfavorable to the null hypothesis while the null hypothesis is actually true.
The Chi Square distribution is the distribution of the sum of squared standard normal deviates. The degrees of freedom of the distribution is equal to the number of standard normal deviates being summed
Probability of getting X-square value of less then or equal to 5 is
pchisq(5,1)
## [1] 0.9746527
X-square value associated with cumulative probability of 0.9746527 is
qchisq(0.9746527,1)
## [1] 5.000001
The chi-squared test is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories.
Chi-Square goodness of fit test is a non-parametric test that is used to find out how the observed value of a given phenomena is significantly different from the expected value. In Chi-Square goodness of fit test, the term goodness of fit is used to compare the observed sample distribution with the expected probability distribution.
Null hypothesis -> There is no significant difference between the observed and the expected value.
Alt. hypothesis -> There is a significant difference between at least one of them
Assumptions:
- Random Sampling from DGP
- Stability in DGP
- Expected value for each group is at least 5
gas<- read_excel("Gasoline.xlsx", col_names=TRUE)
head(gas)
## # A tibble: 6 x 3
## Owner `Second-last` Last
## <dbl> <dbl> <dbl>
## 1 1 3 3
## 2 2 4 3
## 3 3 2 2
## 4 4 3 3
## 5 5 4 3
## 6 6 2 4
table(gas$`Second-last`, gas$Last)
##
## 1 2 3 4
## 1 47 14 27 14
## 2 18 63 33 13
## 3 26 48 59 16
## 4 30 23 29 53
result <- chisq.test(table(gas$`Second-last`, gas$Last))
result
##
## Pearson's Chi-squared test
##
## data: table(gas$`Second-last`, gas$Last)
## X-squared = 114.12, df = 9, p-value < 2.2e-16
Check expected values are greater then equal to 5
result$expected
##
## 1 2 3 4
## 1 24.05848 29.42690 29.42690 19.08772
## 2 29.95517 36.63938 36.63938 23.76608
## 3 35.14425 42.98635 42.98635 27.88304
## 4 31.84211 38.94737 38.94737 25.26316
Cell-by-cell contributions to observed Chi-Square value
(result$residuals)^2
##
## 1 2 3 4
## 1 21.8764183 8.0874729 0.2001518 1.3561017
## 2 4.7713302 18.9654562 0.3614979 4.8770563
## 3 2.3792598 0.5847585 5.9655403 5.0642490
## 4 0.1065681 6.5298009 2.5406117 30.4527412
Null hypothesis -> No Relationship
Alt. hypothesis -> Relationship
Looking at relationships between variables to understand and and predict what is going on.
Response | Predictor |
---|---|
Nominal | Nominal |
Ho:No relationship exists; probabilistic independence
Ha: Relationship exists; probabilistic dependence
Ex. Transit Railroads is interested in the relationship between travel distance and the ticket class purchased. A random sample of 200 passengers is taken.
transit<-read_excel("Transit.xlsx", na="NA", col_names = TRUE)
table(transit)
## Distance
## Class 1-100 101-200 201-300 301-400 401-500
## First 6 8 15 21 10
## Second 14 16 17 14 6
## Third 21 18 16 12 6
result <- chisq.test(table(transit))
result
##
## Pearson's Chi-squared test
##
## data: table(transit)
## X-squared = 15.923, df = 8, p-value = 0.0435
# cell-by-cell contributions to observed Chi-Square value: ((O-E)^2)/E
result$residuals
## Distance
## Class 1-100 101-200 201-300 301-400 401-500
## First -1.7963377 -1.2959032 0.1581139 1.8375516 1.3234482
## Second 0.0715042 0.5145295 0.2294271 -0.4397685 -0.5046460
## Third 1.5600514 0.6819306 -0.3631420 -1.2446101 -0.7163714
Response | Predictor |
---|---|
Interval | Nominal |
F - Distribution: Generally arises from a statistic that involves a ratio of variances. pf & qf functions in R for handling F-distribution
F statistic is the value we receive when we run an ANOVA test on different groups to understand the differences between them. The F statistic is given by the ratio of between group variability to within group variability
If there are two predictors, then it becomes Twoway ANOVA Oneway ANOVA is a test of relationship between a interval level response variable and nominal level predictor variable. Or can also be expressed as a test of multiple means.
Assumptions:
- Random Sampling from DGP
- Stability in DGP
- For each level of predictor variable, if n < 30 then we assume normal distrubtion
- Standard deviation is same for all the groups (Evaluate with Leven's test)
Ho:No relationship exists; Distribution is identical across all categories
Ha: Relationship exists; Difference in means across at-least two categories
chol<- read_excel("Cholesterol.xlsx", col_names=TRUE)
head(chol)
## # A tibble: 6 x 2
## Drug CholReduction
## <chr> <dbl>
## 1 A 22
## 2 A 31
## 3 A 19
## 4 A 27
## 5 A 25
## 6 A 18
fit <- aov(CholReduction ~ Drug, data=chol)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## Drug 2 2152.1 1076.1 40.79 8.59e-07 ***
## Residuals 15 395.7 26.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
DF = (k-1), (n-k) = 2, 15 Mean Sq, gives MST (above) and MSE (below) Sum Sq, gives SST (above) and SSE (below) The probability of getting a sample statistic as or more extreme then 40.79 is the P value. Which is extremely low. Hence, there is a significant relationship between means of at least 2 drugs.
boxplot(CholReduction ~ Drug, data=chol)
Use Tukey HSD test to determine which drugs have relationship.
Studentised range distribution is used to test each pairwise difference
Assumptions: Same as ANOVA
- Random Sampling from DGP
- Stability in DGP
- For each level of predictor variable, in n < 30 then we assume normal distribution
- Standard deviation is same for all the groups (Evaluate with Leven's test)
Ho:No relationship exists: μi = μj for each i, j being checked
Ha: Relationship exists: μi ≠ μj
TukeyHSD(fit, conf.level = .9)
## Tukey multiple comparisons of means
## 90% family-wise confidence level
##
## Fit: aov(formula = CholReduction ~ Drug, data = chol)
##
## $Drug
## diff lwr upr p adj
## B-A 15.50000 8.916894 22.08311 0.0002841
## C-A -11.16667 -17.749773 -4.58356 0.0049961
## C-B -26.66667 -33.249773 -20.08356 0.0000006
Ex. here. B is greater then A buy 15.4 There is a probability of 0.00028 or more to get a sample statistic of 15.5 if Ho where true Hence, all three of them differ significantly
Whenever we wish to determine if variances are all equal or not across different groups
Assumptions:
- Random Sampling from DGP
- Stability in DGP
Ho: σ1 = σ2 = … = σk ; k independent samples/sets of responses
Ha: At least one pair is not equal
library('car')
## Loading required package: carData
leveneTest(chol$CholReduction, chol$Drug)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.0882 0.9161
## 15
Response | Predictor |
---|---|
Interval | At least one Interval |
Assumptions:
- Random Sampling from DGP
- Stability in DGP
- εi ~ Normal (Mean 0, SD σ) constant, over all possible values of the predictors. We can use the following to evaluate this assumption
- Standardized Residuals are normally distributed
- Mean 0, SD fixed across predictors. Plot St. residual plots for individual predictors (Xi) and predicted values (Y hat)
We are interested in predicting the amount of carbohydrates (in grams) for a menu item based on its calorie content (measured in 100s).
starbucks <- read_excel('starbucks.xlsx')
head(starbucks)
## # A tibble: 6 x 7
## item calories fat carb fiber protein type
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 8-Grain Roll 3.5 8 67 5 10 bakery
## 2 Apple Bran Muffin 3.5 9 64 7 6 bakery
## 3 Apple Fritter 4.2 20 59 0 5 bakery
## 4 Banana Nut Loaf 4.9 19 75 4 7 bakery
## 5 Birthday Cake Mini Doughnut 1.3 6 17 0 0 bakery
## 6 Blueberry Oat Bar 3.7 14 47 5 6 bakery
attach(starbucks)
linefit <- lm(carb ~ calories, data = starbucks)
plot(starbucks$calories, starbucks$carb)
abline(linefit,lty=2, col="blue")
Ho:No relationship exists; β1 = β2 =…=βk = 0
Ha: Relationship exists; at least one predictor βi ≠ 0
ANOVA is used to test this
anova(linefit)
## Analysis of Variance Table
##
## Response: carb
## Df Sum Sq Mean Sq F value Pr(>F)
## calories 1 9486.4 9486.4 62.772 1.673e-11 ***
## Residuals 75 11334.3 151.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Ho:No relationship exists; βi = 0, 1 ≤ i ≤ k
Ha: Relationship exists; βi ≠ 0
For each predictor two tailed T - test is used to test this
summary(linefit)
##
## Call:
## lm(formula = carb ~ calories, data = starbucks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.477 -7.476 -1.029 10.127 28.644
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.944 4.746 1.884 0.0634 .
## calories 10.603 1.338 7.923 1.67e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.29 on 75 degrees of freedom
## Multiple R-squared: 0.4556, Adjusted R-squared: 0.4484
## F-statistic: 62.77 on 1 and 75 DF, p-value: 1.673e-11
Standardized residuals plot
linefit.stres <- rstandard(linefit)
plot(starbucks$calories, linefit.stres, pch = 16, main = "Standardized Residual Plot", xlab = "Calories", ylab = "Standardized Residuals
")
abline(0,0, lty=2, col="blue")
Normal probability plot (QQPlot)
qqnorm(linefit.stres, main = "Normal Probability Plot", xlab = "Normal Scores", ylab = "Standardized Residuals")
qqline(linefit.stres, col = "blue")
Shapiro Wilk Normality test
shapiro.test(linefit.stres)
##
## Shapiro-Wilk normality test
##
## data: linefit.stres
## W = 0.99032, p-value = 0.832
plot(linefit$fitted.values, linefit.stres, pch = 16, main = "Standardized Residual Plot", xlab = "Fitted Carbs", ylab = "Standardized Residuals")
abline(0,0, lty=2, col="red")
plot(starbucks$calories, linefit.stres, pch = 16, main = "Standardized Residual Plot", xlab = "Fitted Chol", ylab = "Standardized Residuals")
abline(0,0, lty=2, col="red")
confint(linefit, level=.9)
## 5 % 95 %
## (Intercept) 1.039446 16.84767
## calories 8.374277 12.83190
predict(linefit,data.frame(calories=4.5), interval ="confidence", level = .80)
## fit lwr upr
## 1 56.65746 54.01528 59.29964
predict(linefit, data.frame(calories=4.5), interval="predict", level = .80)
## fit lwr upr
## 1 56.65746 40.5449 72.77002