-
Notifications
You must be signed in to change notification settings - Fork 1
/
stas_asst.Rmd
615 lines (397 loc) · 19.1 KB
/
stas_asst.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
---
title: "Probability & Statistic Final Assignment"
author:
- Tuan Hoang, HE141540
- Lương Văn Hải, HE141667
- Lương Thế Hải, HE140512
- Hoàng Đức Hùng, HE140928
- Vũ Thế Long, HE141751
date: "03/18/2022"
output:
html_document:
fig_height: 4
highlight: pygments
theme: spacelab
editor_options:
markdown:
wrap: 72
---
## Setup
### Load packages
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r load-packages, message = FALSE}
library(ggplot2)
library(dplyr)
library(gridExtra)
library(gssr)
```
### Load data
```{r load-data, message=FALSE}
data(gss_doc)
gss21 <- gss_get_yr(2021)
```
* * *
## Section 1: Data
### Overview
The General Survey (also known as GSS) has been conducted since 1972 to cater to statisticians and data scientists discovering more about public opinion trends in the US.
### Sampling Design
The collecting data method in 2021 was implemented significant methodological changes to ensure the safety of both respondents and interview. For instances, moving from in-person to address-based sampling and online methodology.
Let's go through the details
In order to conduct the GSS, agents obtain samples of web-based and address-based survey
<br>
<br>
<ul>
<li><b>The mail-based Sample</b><br>
<p>
Used DSS - Disproportionate stratified sampling since 2006. They got addresses from strata that were based on NORC’s master sample and their charateristics such as: the the density of the household which is from the America Community Survey. Samples are taken from the north-west corner inward until quotas are met.
In that design, addresses were classified into 2 kinds of stratum that either was metropolitan or rura and within it new primary sampling units (PSUs) are being used to yield addresses for surveying.
</p>
</li>
<li><b>The Web-based Sample</b><br>
<p>
As an online questionnaire. In this survey, some of the questions that had the same theme were listed on the screen together as a grid so that they can answer them in one go.
Before being published, the data were weighted based on the proportions to equal the 2019 American Community Survey Census Bureau to ensure unbiased characteristics.
</p>
</li>
</ul>
### Scope of inference
As mentioned above, GSS use stratified sampling design based on the NORC master sample to select a representative sample from each state residents.
Thus, each state began with a single stratum to provide adequate sample sizes which is defined populations of interest but many states sample were disproportionate in their strata that correspond to sub-state or non-metropolitant regions.
#### Generalizability
The data are collected from multiple areas. Moreover, in each cluster group is sampled, the measure is taken at higher levels and by percentages to ensure the distribution of key variables (sex, gender, etc).
Therefore, it can be generalized to the US population.
#### Causality
Making causal conclusions based on observational data is not recommended. Observational studies are only sufficient to show correlations.
### Summary
The data collection is based on a stratified sampling strategy. The population is divided into groups (Region) called strata. Then a random sampling (characteristics), has been employed within each stratum.It can be generalized to the US population
<b>The data and further information were used in this project from the following sources:</b>
General Information: [2021 Survey Data
Information](http://gss.norc.org/)
* * *
## Section 2: Research questions
We would like to focus on the behaviors, demographics, and opinions
**Research question 1:**
Do women oftentimes feel uneasy strolling alone at night in their neighborhood?
Most people often hold the common perception that the girls are more afraid of walking alone at night than the males do.
We can further dive into it by splitting the data by geographical locations
* Explanatory variable: `sex`
```{r}
gss_doc %>% filter(id == "sex") %>%
select(description)
```
* Response variable: `fear`
```{r}
gss_doc %>% filter(id == "fear") %>%
select(description)
```
**Research question 2:**
Do women or men feel more comfortable when strolling alone at night in their
neighborhood?
- Explanatory variable: `sex`
```{r}
gss_doc %>% filter(id == "sex") %>%
select(description)
```
- Response variable: `fear`
```{r}
gss_doc %>% filter(id == "fear") %>%
select(description)
```
**Research question 3:**
Can we predict how much time users spend using email based on the internet hours and age?
- Explanatory variable: `age` and `wwhr`
```{r}
gss_doc %>% filter(id == "age") %>%
select(description)
gss_doc %>% filter(id == "wwwhr") %>%
select(description)
```
- Response variable: `emailhr`
```{r}
gss_doc %>% filter(id == "emailhr") %>%
select(description)
```
------------------------------------------------------------------------
## Section 3: EDA - Exploratory data analysis
**Research question 1:**
***Do women oftentimes feel uneasy strolling alone at night in their neighborhood?***
First, we need to get the data and clean to get the grasp of the data
```{r load-ffear-data}
gss21_sex <-
gss21 %>% filter(!is.na(gss21$sex), is.numeric(gss21$sex)) %>%
mutate_at(vars(matches("sex")) , ~ifelse(sex == 1 , "Male", "Female"))
ffear <-
gss21_sex %>% filter(sex == "Female", !is.na(gss21_sex$fear), is.numeric(gss21_sex$fear)) %>%
mutate_at(vars(matches("fear")), ~ifelse(fear == 1, "Yes", "No"))
ffear %>% group_by(sex, fear) %>% summarise(count = n())
```
What I did there is filter out the "NA" entries and convert fields to human-readable format.
As depicted from the table, there are a significant amount of observations that women are comfortable strolling at night. This also indicates that most girls are all right to take a walk in dark.
Let's further investigate the statement
```{r plot-ffear-sample}
ffear$fear <- as.factor(ffear$fear)
ggplot(ffear, aes(x = fear)) +
geom_bar(fill = "#8FDEE1") +
xlab("Fear") +
ylab("Number of people") +
ggtitle('Sample Distribution')
```
As is evident from the graph, the number of girls who are not afraid
when walking at night outweighed the scared ones.
**Research question 2:**
***Do women or men feel more comfortable when strolling alone at night in their
neighborhood?***
In the above we have data about female, in this part we need to get more data of male about whether they feel comfortable when strolling alone at night in their neighborhood?
```{r load-mfear-data}
fear <- gss21_sex %>% filter(!is.na(gss21_sex$fear), is.numeric(gss21_sex$fear)) %>%
mutate_at(vars(matches("fear")), ~ifelse(fear == 1, "Yes", "No"))
fear$fear <- as.factor(fear$fear)
fear$sex <- as.factor(fear$sex)
mfear <-
fear %>% filter(sex == "Male")
fear %>% group_by(sex, fear) %>% summarise(count = n())
```
Below is a chart showing the above data
```{r plot-fear-sample}
ggplot(fear, aes(x=fear, fill = fear)) +
geom_bar(position="dodge") +
facet_wrap( ~ sex, ncol=2) +
xlab("Fear") +
ylab("Number of people") +
ggtitle('Sample Distribution')
```
From the graph we can see that the number of people who feel comfortable walking at night in their
neighborhood is in the majority of both men and women.
**Research question 3:**
Firstly, Let's filter out outliners from the data using IQR technique
```{r summary-q3-data}
gss_q3 <-
gss21 %>% filter(!is.na(gss21$age), is.numeric(gss21$age), !is.na(gss21$wwwhr), is.numeric(gss21$wwwhr), !is.na(gss21$emailhr), is.numeric(gss21$emailhr), ) %>%
mutate(age_grp = case_when(
age <= 30 ~ 'Adult ',
age <= 60 ~ 'Middle Age',
age > 60 ~ 'Elder'
))
www_Q1 <- quantile(gss_q3$wwwhr, .25)
www_Q3 <- quantile(gss_q3$wwwhr, .75)
www_IQR <- IQR(gss_q3$wwwhr)
em_Q1 <- quantile(gss_q3$emailhr, .25)
em_Q3 <- quantile(gss_q3$emailhr, .75)
em_IQR <- IQR(gss_q3$emailhr)
gss_q3 <- gss_q3 %>% filter((gss_q3$wwwhr > (www_Q1 - 1.5*www_IQR) & gss_q3$wwwhr < (www_Q1 + 1.5*www_IQR)),
(gss_q3$emailhr > (em_Q1 - 1.5*em_IQR) & gss_q3$emailhr < (em_Q1 + 1.5*em_IQR))
)
gss_q3$age_grp <- as.factor(gss_q3$age_grp)
summary(gss_q3[c('age_grp', 'wwwhr', 'emailhr')])
```
```{r plot_box}
ggplot(gss_q3, aes(x = age_grp, y = wwwhr)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_hline(yintercept = median(gss_q3$wwwhr))
```
As what we see on the plot, the time adults spend time on surfing the
internet is much longer than other 2 age groups and they are also using
email much more longer than the other ones and people in Elder age group
spend time on the internet and email least.
```{r histogram}
q3h_www <- ggplot(gss_q3, aes(x = wwwhr)) +
geom_histogram(binwidth = 1) +
labs("Internet Time")
q3h_email <- ggplot(gss_q3, aes(x = emailhr)) +
geom_histogram(binwidth = 1)+
labs("Email Time")
grid.arrange(q3h_www, q3h_email, nrow = 2)
```
The plot shows to us that most of the people who did this survey spend
less than 10 hours a week for the internet and less than 6 hours a week
for email
```{r qualify_cc}
gss_q3 %>%
summarise(cor(wwwhr, emailhr))
```
The scope value of 2 variables was almost zero, we will look into this by representing it in scatter graph.
``` {r }
ggplot(data = gss_q3, aes(x = wwwhr, y = emailhr )) + geom_jitter() + geom_smooth(method = "lm", se = FALSE)
```
Here is the regression line for predicting the time people spend time
for email by the time they spend for internet
------------------------------------------------------------------------
## Section 4: Inference
**Research question 1:**
***Confidence Interval:***
*Conditions for central limit theorem:*
1. Independence Check:
The observations are collected from a simple random sample. Also, it consists of fewer observations than 10% of the US population, which we could safely assume to be independent.
2. Sample size / skew check (success-failure check):
The sample size `n` of our dataset was 2,197. From which 1,239 success (not-afraid) cases and 958 failure (scare) cases.
Both conditions passed, thus the distribution is nearly normal as defined by the central limit theorem.
We will construct a 95% CI for the proportion of girls that are not afraid to walk alone in the evening in their locals
```{r compute-ffear-ci}
ffear_len <- length(ffear$fear)
ffear_p_hat <- sum(ffear$fear == "No") / ffear_len
ffear_conf_level <- 0.95
ffear_perc_crit_value <- ffear_conf_level + ((1 - ffear_conf_level) / 2)
ffear_zstar <- qnorm(ffear_perc_crit_value)
ffear_se <- sqrt(ffear_p_hat * (1 - ffear_p_hat) / ffear_len)
ffear_me <- ffear_zstar * ffear_se
ffear_ci <- ffear_p_hat + c(-1,1) * ffear_me
cat(paste0("n = ", ffear_len, ", p-hat = ", round(ffear_p_hat, 4), "\n",
ffear_conf_level, "% CI: (", round(ffear_ci[1], 4), " , ", round(ffear_ci[2], 4), ")\n"))
```
We are `r ffear_conf_level * 100`% confident that `r round(ffear_ci[1], 4)`% to `r round(ffear_ci[2], 4)`% of all American females are ok with night strolls.
***Hypothesis testing:***
Let's continue the question from above:
We have 2 hypothesizes.
*H0:* Girls are terrified of walking without accompany at night in their near vicinity (p = 0.5)
*H1:* They walk at night in their neighborhood without hesitant (p > 0.5)
*Conditions for central limit theorem:*
1. Independence Check:
As mentioned above, we can safely assume it independence
2. Sample size / skew check (success-failure check):
From the data, we see that the observed success rate (p-hat) is `r ffear_p_hat` <br/>
The test proportion `p` used in our hypotheses will be 0.5 <br/>
We also had np = `r ffear_len` * 0.5 = `r ffear_len * 0.5`
Those gave us more than 10 cases for both successes and failures
Because all the tests passed, the distribution is nearly normal as defined by the central limit theorem.
Let's look into this.
```{r compute-ffear-ht}
ffear_null = 0.5
ffear_zstats <- (ffear_p_hat - ffear_null) / ffear_se
ffear_x_min <- ffear_p_hat
ffear_x_max <- Inf
ffear_p_value <- pnorm(ffear_zstats, lower.tail = FALSE)
ffear_p_val_to_print <- ifelse(round(ffear_p_value, 4) == 0, "< 0.0001", round(ffear_p_value, 4))
cat(paste0("n = ", ffear_len, ", p-hat = ", round(ffear_p_hat, 4), "\n",
"H0: p = ", ffear_null, "\n" ,
"H1: p >", " ", ffear_null, "\n",
"z = ", round(ffear_zstats, 4), "\n",
"p_value = ", ffear_p_val_to_print
))
```
The hypothesis test yields `r ffear_zstats`, making the p-value negligible (`r ffear_p_val_to_print`).
Let's construct the null sample graph to make it more transparent
```{r plot-ffear-null}
ggplot(data.frame(x = c(ffear_null - 4 * ffear_se, ffear_null + 4 * ffear_se)), aes_string(x= 'x')) +
stat_function(fun = dnorm, args = list(mean = ffear_null, sd = ffear_se), color = "#999999") +
annotate("rect", xmin = ffear_x_min, xmax= ffear_x_max, ymin = 0, ymax = Inf, alpha = .3, fill= "#FABAB8") +
ggtitle("Null distribution") +
xlab("") +
ylab("") +
geom_vline(xintercept = ffear_p_hat, color ="#F57670", lwd = 1.7)
```
By observing the results, we can see that the probability of the null hypothesis is almost 0. Therefore, we reject the null hypothesis.
**Research question 2:**
***Confidence Interval:***
*Conditions for central limit theorem:*
Here we test with two probabilities p_hat_Male and p_hat_Female
with `ffear_p_hat` the probability (success)(not afraid) with the female gender(p^2)(calculated above)<br>
and `mfear_p_hat` probability (success)(not afraid) for male gender(p^1)
estimate for difference`fear_p_hat_diff` (p^)
We will construct a 95% CI for the proportion of women and men that are not afraid to walk alone in the evening in their locals
```{r compute-fear-ci}
mfear_len = length(mfear$fear)
mfear_p_hat <- sum(mfear$fear == "No") / mfear_len
fear_p_hat_diff <- mfear_p_hat - ffear_p_hat
fear_conf_level <- 0.95
fear_perc_crit_value <- fear_conf_level + ((1 - fear_conf_level) / 2)
fear_zstar <- qnorm(fear_perc_crit_value)
fear_se <- sqrt((mfear_p_hat * (1 - mfear_p_hat) / mfear_len) + (ffear_p_hat * (1 - ffear_p_hat) / ffear_len))
fear_me <- fear_zstar * fear_se
fear_ci <- fear_p_hat_diff + c(-1,1) * fear_me
cat(paste0("Response variable: categorical (", length(levels(fear$fear)), " levels, success: ", "No", ")\n"))
cat(paste0("Explanatory variable: categorical (", length(levels(fear$sex)), " levels) \n"))
fear_grp1 = "Male";
fear_grp2 = "Female";
cat(paste0("n_", fear_grp1, " = ", mfear_len, ", p_hat_", fear_grp1, " = ", round(mfear_p_hat, 4), "\n",
"n_", fear_grp2, " = ", ffear_len, ", p_hat_", fear_grp2, " = ", round(ffear_p_hat, 4), "\n",
fear_conf_level, "% CI (", fear_grp1 ," - ", fear_grp2,"): (", round(fear_ci[1], 4), " , ", round(fear_ci[2], 4), ")\n"))
```
We are `r fear_conf_level * 100`% confident that `r round(fear_ci[1], 4)`% to `r round(fear_ci[2], 4)`% of all American females and males are ok with night strolls.
The graph clearly shows the probability of being afraid and not afraid of the two sexes
```{r}
ggplot(data = fear, aes(x = fear, fill = sex), environment = environment()) +
geom_bar(position = "fill") +
scale_fill_manual(values = c("#1FBEC3", "#8FDEE1")) +
xlab("Fear") +
ylab("") +
ggtitle("Sample Distribution") +
guides(fill = guide_legend(title = "Sex"))
```
***Hypothesis testing:***
Let's continue the question from above:
We have 2 hypothesizes.
*H0:* Percentage of girls and boys who are afraid to walk unaccompanied at night in their vicinity is the same (p_Male = p_Female)
*H1:* The proportion of girls and boys who are afraid to walk unaccompanied at night in their vicinity is not the same ( p_ Male != p_ Female )
*Conditions for central limit theorem:*
1. Independence Check:
As mentioned above, we can safely assume it independence
2. Sample size / skew check (success-failure check):
From the data, we see that the observed success rate (p_hat_diff) is `r fear_p_hat_diff` <br/>
The test proportion `p` used in our hypotheses will be 0.5 <br/>
Those gave us more than 10 cases for both successes and failures
Because all the tests passed, the distribution is nearly normal as defined by the central limit theorem.
Let's look into this.
```{r compute-fear-ht}
fear_null = 0.2
fear_zstats <- (fear_p_hat_diff - fear_null) / fear_se
fear_suc_tot <- sum(ffear$fear == "No") + sum(mfear$fear == "No")
fear_n_tot <- length(fear$fear)
fear_p_pool <- fear_suc_tot / fear_n_tot
fear_x_min <- if(fear_p_hat_diff >= fear_null) c(fear_null - (fear_p_hat_diff - fear_null), fear_p_hat_diff) else c(fear_p_hat_diff, fear_null + (fear_null - fear_p_hat_diff))
fear_x_max <- c(-Inf, Inf)
fear_p_value <- 2 * pnorm(abs(fear_zstats), lower.tail = FALSE)
fear_p_val_print <- ifelse(round(fear_p_value, 4) == 0, "< 0.0001", round(fear_p_value, 4))
cat(paste0("H0: p_", fear_grp1, " = p_", fear_grp2, "\n"),
"H1: p_", fear_grp1, " ", "!=", " p_", fear_grp2, "\n",
"z = ", round(fear_zstats, 4), "\n",
"p_value = ", fear_p_val_print)
```
From the results we see that z = -0.2712 is not in the specified range. So we reject H0, accept H1
See the chart below for more details
```{r plot-fear-null}
ggplot(data.frame(x = c(fear_null - 4 * fear_se, fear_null + 4 * fear_se))) +
stat_function(fun = dnorm, args = list(mean = fear_null, sd = fear_se), color = "#999999") +
annotate("rect", xmin = fear_x_min, xmax = fear_x_max, ymin = 0, ymax = Inf,
alpha = 0.3, fill = "#FABAB8") +
ggtitle("Null Distribution") +
xlab("") +
ylab("") +
geom_vline(xintercept = fear_p_hat_diff, color = "#F57670", lwd = 1.5)
```
------------------------------------------------------------------------
## Section 5: Prediction
**Research question 3:**
Can we predict how much time users spend using
email based on the internet hours and age?
``` {r}
m1 <- lm(wwwhr ~ emailhr + age_grp , data = gss_q3)
summary(m1)$adj.r.squared
m2 <- lm(wwwhr ~ emailhr , data = gss_q3)
summary(m2)$adj.r.squared
```
Both model had
R-square ~ 0. Thus, two variables has a weak correlation. Therefore, we cannot use this model to lm to predict
The best model has analysis of variance such as:
```{r}
anova(m1)
```
------------------------------------------------------------------------
## Section 6: Conlusion
**Research question 1:**
From our hypothesis testing results, we can accept the alternative
hypothesis. That is, that females are afraid of walking alone at night
in their local neighborhood. As mentioned previously, further research
could be done to analyze this response by geographical location to see
if this result varies across different regions in America.
**Research question 2:**
From our hypothesis testing results, we can accept the alternative hypothesis. The proportion of girls and boys who are afraid to walk unaccompanied at night in their vicinity is not the same . <br>
As mentioned previously, further research could be done to analyze this response by geographical location to see if this result varies across different regions in America.
**Research question 3:**
We cannot draw any conclusion from the model.
* * *
## Section 7: Citations & References
Healy K (2019). gssr: General Social Survey data for use in R. R package version 0.2.0, http://kjhealy.github.io/gssr.
GSS website: http://gss.norc.org/ (11/01/2021)