-
Notifications
You must be signed in to change notification settings - Fork 0
/
FirstMarkdown.Rmd
269 lines (189 loc) · 12.4 KB
/
FirstMarkdown.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
---
title: "Murder Data"
author: "Sonia Baron, Elise Gonzalez, Zim Ouilette"
date: "March 26, 2018"
output:
html_document: default
word_document: default
---
#Introduction.
The current project is an analysis of five selected cities across the years. These cities were selected based on a follow-up article of our original FiveThirtyEight data. Published in july 2017, this article provided the cities with the highest and the lowest changes from 2016 to mid-year 2017. We wondered whether these cities presented a sudder change, or if they were slowly changing into such results.
In a follow up article, the murder rates from 2017 across the US are compared to the murder rates from the 1990s. The article explains that while the murder rates have seen an overall increase over the past few years (since 2014), they are not nearly as high as in the 1990s. Another interesting point that was made is that there is evidence that shows that when a city's murder rate increases, the national average of murder rates seems to counteract or account for this by showing a decrease.
Big cities also tend to skew the average murder rates in the US.
The second half of the year, murders
install.packages(c("maps", "mapdata"))
install.packages("reshape2")
install.packages("mapdata")
install.packages("ggmaps")
## loading libraries
```{r setup, include=FALSE}
#Loading our libraries for running later executions.
knitr::opts_chunk$set(echo = TRUE)
library("knitr") # for kable and knitting results into a report
library("broom")
library("tidyverse")
library("modelr")
library("UsingR")
library("GGally")
library("dplyr")
library("plyr")
library("reshape2")
library("maps")
library("mapdata")
```
## Load data sets
```{r FiveThirtyEight data}
#In this chunk we load the initial dataset and clean it. It took us a few minutes to figure out, but the data set is fairly basic.
# Data sets from FiveThirtyEight
murder15A <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/murder_2016/murder_2015_final.csv")
murder16A <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/murder_2016/murder_2016_prelim.csv")
#A items are the uncleaned versions, murder15 and murder16, etc. are cleaned.
murder15 <- subset(murder15A, select = -c(change))
head(murder15)
murder16 <- subset(murder16A, select = -c(source,as_of,change))
head(murder16)
# Last dates data for 2016 was collected
muder16Dates <- subset(murder16A, select = -c(source,change))
# need to organize these two by state
?arrange
murder15 <- murder15 %>%
arrange(murder15$state)
murder16 <- murder16 %>%
arrange(murder16$state)
```
```{r, FBI Data}
#Loads the cleaned FBI data we retrieved off the FBI website. Source at bottom of this doc.
FBI2000 <- read.csv("https://raw.githubusercontent.com/Data-Science-Team-Murder/FauDataSci4real/master/FBI2000.csv")
FBI2005 <- read.csv("https://raw.githubusercontent.com/Data-Science-Team-Murder/FauDataSci4real/master/FBI2005.csv")
FBI2010 <- read.csv("https://raw.githubusercontent.com/Data-Science-Team-Murder/FauDataSci4real/master/FBI2010.csv")
FBI2015 <- read.csv("https://raw.githubusercontent.com/Data-Science-Team-Murder/FauDataSci4real/master/FBI2015.csv")
FBI2016 <- read.csv("https://raw.githubusercontent.com/Data-Science-Team-Murder/FauDataSci4real/master/FBI2016.csv")
```
# Understanding the FiveThirtyEight Data
We run a multiple linear regression matrix to get a better sense of the data
```{r EDA with ggpairs}
#An analysis using ggpairs. The amount of data in here is staggering, just look at it all!
data(murder15)
ggpairs (murder15,
lower = list(
continuous = "smooth"),
axisLabels ="none",
switch = 'both',
cardinality_threshold = 100)
data(murder16)
#str(murder15)
ggpairs (murder16,
lower = list(
continuous = "smooth"),
axisLabels ="none",
switch = 'both',
cardinality_threshold = 100)
```
```{r heat map}
```
```{r state values, echo = FALSE}
FBI2016$STATE
FBI2015$STATE
FBI2010$STATE
FBI2005$STATE
FBI2000$STATE
#We used this to display the list of states (rowss) and then We went through and wrote down the row numbers of the states that we are going to work with. The row numbers vary between the data sets we are using which is why it is important that we select the correct row numbers for each data set.
```
```{r exclusion}
#we are only using certain states and we are removing the repeating "State" columb from each dataframe.
FBI2000inc<-FBI2000[c(19,21,26,34,39),]
FBI2000dec<-FBI2000[c(11,17,31,33,45),]
FBI2005inc<-FBI2005[c(17,19,24,32,37),-1]
FBI2005dec<-FBI2005[c( 9,15,29,31,42),-1]
FBI2010inc<-FBI2010[c(18,20,25,33,38),-1]
FBI2010dec<-FBI2010[c(10,16,30,32,43),-1]
FBI2015inc<-FBI2015[c(18,20,25,33,38),-1]
FBI2015dec<-FBI2015[c(10,16,30,32,43),-1]
FBI2016inc<-FBI2016[c(18,20,25,33,38),-1]
FBI2016dec<-FBI2016[c(10,16,30,32,43),-1]
```
```{r bind data frames}
#here, we are combining the data frames from above to create two dataframes to show just the states that showed an increase in murder in 2017 and those that showed a decrease in murder, respectively.
cleanFBIinc<-cbind.data.frame(FBI2000inc,FBI2005inc,FBI2010inc,FBI2015inc,FBI2016inc)
cleanFBIdec<-cbind.data.frame(FBI2000dec,FBI2005dec,FBI2010dec,FBI2015dec,FBI2016dec)
cleanFBIinc
cleanFBIdec
```
we could not find the data by city across years, so we plot the states of each of the cities
```{r plotting increasing with data frame in R}
#Creating dataframe for ggplot graph. We also set up some numbers to use during the graphing.
Statesinc <- c("Pennsylvania", "Pennsylvania", "Pennsylvania", "Pennsylvania", "Pennsylvania", "North Carolina", "North Carolina", "North Carolina", "North Carolina", "North Carolina", "Missouri", "Missouri", "Missouri", "Missouri", "Missouri", "Maryland","Maryland","Maryland","Maryland","Maryland", "Louisiana","Louisiana","Louisiana","Louisiana","Loisiana")
year<-c('2000','2005','2010','2015','2016','2000','2005','2010','2015','2016','2000',' 2005', '2010','2015', '2016','2000','2005','2010','2015','2016','2000','2005','2010','2015','2016')
murder<-c(602, 734, 646,651, 655, 560,566,445,506,595,347,401,419,499,535,430,551,424,372,430,560,399,437,474,543)
(FBIinc <- data.frame(Statesinc, year, murder))
ggplot(FBIinc) +
geom_line(aes(x = year, y = murder, group= Statesinc, color = Statesinc), size = 1.5) +
labs(x = "Year", y = "Number of Murders", title = "States Whose Cities Had the Largest Increase in Murder in Mid 2017", caption = "Source: https://ucr.fbi.gov/crime-in-the-u.s/")
```
```{r plotting increasing with the csv file}
FBI2000Graph <- read_csv("https://raw.githubusercontent.com/Data-Science-Team-Murder/FauDataSci4real/master/FBI2000Graph.csv")
head(FBI2000Graph)
ggplot(FBI2000Graph) +
geom_line(aes(x = YEAR, y = MURDERS, group= STATE, color = STATE), size = 1.5) +
labs(x = "Year", y = "Number of Murders", title = "States Whose Cities Had the Largest Increase in Murder in Mid 2017", caption = "Source: https://ucr.fbi.gov/crime-in-the-u.s/")
```
```{r plotting decreasing states with data frame in R}
#Creating dataframe for ggplot graph. We also set up some numbers to use during the graphing.
Statesdec<-c("Georgia", "Georgia", "Georgia", "Georgia", "Georgia", "Kansas", "Kansas", "Kansas", "Kansas", "Kansas", "New Jersey", "New Jersey", "New Jersey", "New Jersey", "New Jersey", "New York","New York","New York","New York","New York", "Texas","Texas","Texas","Texas","Texas")
year<-c('2000','2005','2010','2015','2016','2000','2005','2010','2015','2016','2000',' 2005', '2010','2015', '2016','2000','2005','2010','2015','2016','2000','2005','2010','2015','2016')
murder<-c(651,516,527,565,646,169,95,100,125,96,289,417,363,353,372,952,868,860,611,628,1238,1406,1246,1276,1459)
(FBIdec<-data.frame(Statesdec,year,murder))
ggplot(FBIdec) +
geom_line(aes(x = year, y = murder, group = Statesdec, color = Statesdec), size = 1.5) +
labs(x = "Year", y = "Number of Murders", title = "States Whose Cities Had the Largest Decrease in Murder in Mid 2017", caption = "Source: https://ucr.fbi.gov/crime-in-the-u.s/" )
```
```{r plotting decreasing states with data frame from the CSV file}
#This graphs the states based on cities which had the largest decreases.
FBI2000Graph2 <- read_csv("https://raw.githubusercontent.com/Data-Science-Team-Murder/FauDataSci4real/master/FBI2000Graph2.csv")
head(FBI2000Graph2)
ggplot(FBI2000Graph2) +
geom_line(aes(x = YEAR, y = MURDERS, group= STATE, color = STATE), size = 1.5) +
labs(x = "Year", y = "Number of Murders", title = "States Whose Cities Had the Largest Increase in Murder in Mid 2017", caption = "Source: https://ucr.fbi.gov/crime-in-the-u.s/")
```
```{r plotting state and cities}
#head(FBI2000Graph2)
head(murder15)
ggplot(murder15) +
geom_line(aes(x = state, y = murder15$x2014_murders, group= state, color = state), size = 1.5) +
labs(x = "Year", y = "Number of Murders", title = "States Whose Cities Had the Largest Increase in Murder in Mid 2017", caption = "Source: https://ucr.fbi.gov/crime-in-the-u.s/")
```
```{r truncating and plotting five-thirty-eight data}
#This graphs data from the cities in question based on data from the fivethirtyeight dataset.
#(murder15graph<-murder15[c(31,51,67,54,45,39,71,36,40,56),])
#library(reshape2)
?melt
#xymelt <- melt(murder15$X2014_murders, murder15$x2015_murders, id.vars = "murder2015$x2014_murders")
#xymelt
library(ggplot2)
ggplot(xymelt, aes(x = a, y = value, color = variable)) +
theme_bw() +
geom_line()
ggplot(xymelt, aes(x = a, y = value)) +
theme_bw() +
geom_line() +
facet_wrap(~ variable)
#ggplot(murder15graph) +
# geom_line(aes(x = city, y = murder, group= city, color = city), size = 1.5) +
# labs(x = "Year", y = "Number of Murders", title = "Cities With Largest Change in Murder Mid 2017", caption = "Source: fivethirtyeight.com")
```
## Analysis
## Conclusion
Over the course of this project we came accross many, many challenges, but we overcame in the end. To begin we had to figure out the proper method of importing our datasets and cleaning the data. By reviewing our notes and homework we were able to integrate the data correctly into our document. Finding corroborating data to elaborate on our initial set was one of the largest early challenges. Through careful searching we eventually uncovered the crime reports over decades from the FBI website. Utilizing drop box was a complicating matter for our group. Eventually we decided to only work on the r markdown file on one computer at a time. Oh, also back slashes were a problem. Ggplot gave us some trouble but thanks to the help we received we triumphed in the end! It took us ours of effort to produce an acceptable quality graph, but unfortunately after all that work our dates were disrupted due to unknown errors. We corrected this by using external programs to reorganize our tables for ggplot. It was a kludge, but it got the results we need. Overall we learned a lot about managing data, and delivering results in a timely fashion, all of which might be of great help in the future.
This project really gave our team a chance not only to sharpen our data science skillset, but also enabled us to practice our ability cooperate as a team to overcome unexpected challenges we faced. We learned proper means of work distribution, how to ask for help, and that they had free food in the burrow last night which we ate to celebrate our team's accomplishments. The cake was ok.
In conclusion, the essential value of R data science analysis is its capibilities of allowing the examination of diverse values and variables, as well as its capacity to facilitate constructive group exercise. Because of these virtues, R analyses such as this will have continued value as the field of data science develops.
## Sources
1. https://fivethirtyeight.com/features/a-handful-of-cities-are-driving-2016s-rise-in-murders/
2. https://fivethirtyeight.com/features/murder-is-up-again-in-2017-but-not-as-much-as-last-year/
3. https://ucr.fbi.gov/crime-in-the-u.s/
4. Github account; https://github.com/Data-Science-Team-Murder/FauDataSci4real
Special thanks to Reilly and Stefán for their help!!
sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows >= 8 x64 (build 9200)
4.