-
Notifications
You must be signed in to change notification settings - Fork 0
/
ann.Rmd
322 lines (250 loc) · 10.9 KB
/
ann.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
---
title: "Artificial Neural Network - Classification of Stars and Simple Stellar Populations applied to PASTA data set"
author: Morgan Camargo, Paula Coelho, Alessandro Ederoclite
output:
html_notebook:
toc: true
---
# The first ANN implementation
An ANN structure is constituted by layers of nodes, so that all nodes in the $i$-th layer connect with all nodes in the $(i+1)$-th layer, $i \in \left\{1,...,N-1\right\}$ ($N$ is the number of layers). Each node carries a number, which can be an input sign (first layer case) or a probability (in the others). Below is shown a scheme of our ANN.
![](ann.png){width=80%}
This is an ANN classification algorithm written in R. First of all we implement the
functions we use to compose the network.
Activation functions convert an input signal of a node to an output
signal that can be used as input in the next layer. Here we use the
Sigmoid and Softmax functions.
```{r}
# sigmoid activation function: f(x) = 1/(1+exp(-x))
sigmoid = function(x, d = FALSE) {
if (d == FALSE) {return(1.0/(1 + exp(-x)))}
else {return(x*(1.0 - x))} }
# softmax activation function: f(xi) = exp(xi)/sum(exp(xi)), xi the ith component of the vector x
softmax = function(A) {return(exp(A) / rowSums(exp(A)))}
```
The next function loads the dataset and splits it into the three samples
we need: training (60%), validation (20%) and test (20%).
```{r}
load_dataset = function(vals, tests) {
dpath = "./catalog/xdata+ssp.txt"
dataset = read.csv(dpath, sep = "")
colours = c('FUV-r', 'NUV-r', 'u-r', 'J378-r', 'J395-r', 'J410-r', 'J430-r', 'g-r', 'J515-r', 'bp-r', 'r-J660', 'r-G','r-i','r-rp', 'r-J861', 'r-z', 'r-J', 'r-H', 'r-Ks', 'r-W1', 'r-W2', 'r-W3', 'r-W4')
sample_percentage = 0.1
n_ssp = 531 # how many GC ssp were imported from the data
n_star = as.integer(n_ssp/sample_percentage) - n_ssp # the remaining of the set will be sampled stars
colnames(dataset) = c('label', colours)
# sampling the data
ssp = dataset[dataset$label == 1, ]
star = dataset[sample(nrow(dataset[dataset$label == 2, ]), size = n_star),]
dataset = rbind(ssp,star)
dataset = dataset[sample(nrow(dataset)),] # sorting randomly
X = matrix(unlist(dataset), ncol = 24)
X = X[,2:24]
Y = dataset$label
vals = as.integer(vals*dim(dataset)[1])
tests = as.integer(tests*dim(dataset)[1])
# splitting
X_validation = X[1:vals,]
Y_validation = Y[1:vals]
X_test = X[vals + 1:vals + tests,]
Y_test = Y[vals + 1:vals + tests]
X_train = X[(vals + tests + 1):dim(dataset)[1],]
Y_train = Y[(vals + tests + 1):dim(dataset)[1]]
samples = list("X_train" = X_train,
"Y_train" = Y_train,
"X_validation" = X_validation,
"Y_validation" = Y_validation,
"X_test" = X_test,
"Y_test" = Y_test)
return(samples) }
```
The dataset here is composed by 11 features of SSP’s (from line 1 to
636) and stars (the remain). It also has a column of labels, which indicates
the class, 1 for SSP and 2 for star.
Given the trained model, this routine predicts the classification of
elements, based on the array of probabilities, which has the
probabilities of each object being a SSP or a star. Then it predicts the
classification taking the higher probability.
```{r}
predict = function(model, X) {
fp = forward_propagation(model, X)
probs = fp$probs
colnames(probs) = c(1,2)
choice = apply(probs,1,which.max)
cname = colnames(probs)[choice]
r = as.integer(cname)
return(r) }
```
This routine starts the Ws (weight matrices) and bias arrays and train
the nn using the training sample and checking its performance on
validation dataset.
```{r}
build_train_model = function(ann_model) {
# initialize the weigths (random values) and bias (=0)
W1 = matrix(rnorm(ann_model$n_input_dim * ann_model$n_hlayer), ann_model$n_input_dim, ann_model$n_hlayer) / sqrt(ann_model$n_input_dim)
b1 = matrix(0L,1,ann_model$n_hlayer)
W2 = matrix(rnorm(ann_model$n_hlayer * ann_model$n_output_dim), ann_model$n_hlayer, ann_model$n_output_dim) / sqrt(ann_model$n_hlayer)
b2 = matrix(0L,1,ann_model$n_output_dim)
# define model which will contains Ws and biases
model = list("W1" = W1, "b1" = b1, "W2" = W2, "b2" = b2)
# loop over the n_passes
for(i in 1:ann_model$n_passes) {
# forward propagation
fp = forward_propagation(model, ann_model$X_train)
probs = fp$probs
a2 = fp$a2
# backpropagation
model = back_propagation(ann_model, probs, a2, model)
p = predict(model, ann_model$X_validation)
s = score(p, ann_model$Y_validation)
if(i%%50 == 0) {print(sprintf("Score after iteration %i: %f", i, s))}
}
return(model) }
```
Let a1 be the array of features. The forward Propagation function gives
us the output layer signals, which are the classification probabilities.
```{r}
forward_propagation = function(model, X){
# forward propagation
W1 = model$W1
b1 = model$b1
W2 = model$W2
b2 = model$b2
a1 = X
z1 = a1 %*% W1
# adding b1
for (i in 1:ncol(b1)) {
for (j in 1:nrow(z1)) {
z1[j,i] = z1[j,i] + b1[1,i] } }
a2 = sigmoid(z1) # hidden layer activation function: sigmoid
z2 = a2 %*% W2
# adding b2
for (i in 1:ncol(b2)) {
for (j in 1:nrow(z2)) {
z2[j,i] = z2[j,i] + b2[1,i] } }
probs = softmax(z2) # hidden layer activation function: softmax
return(list("probs" = probs, "a2" = a2)) }
```
Then we have the BackPropagation function, with the Gradient Descent
algorithm. It changes the parameters values on a way that minimizes the
loss function, based on its derivatives with respect to the weights and
bias (Chain Rule).
```{r}
back_propagation = function(ann_model, probs, a2, model) {
# loading model
W1 = model$W1
b1 = model$b1
W2 = model$W2
b2 = model$b2
# backpropagating
error = probs
for (i in 1:ann_model$n_train){
error[i,Y_train[i]] = error[i,Y_train[i]] - 1 } # loss function derivative
delta1 = error %*% t(W2) * sigmoid(a2, d = TRUE)
#################################
# There's some weird error on these next operations, which occur sometimes and sometimes it doesn't occur.
# Basically both colSums and matrix multiplications returns tables with only NA values.
# Entry values (a2, error, delta1, X_train) all seems fine, so no idea what's going on WHEN the error happens.
#################################
# weights
dW2 = t(a2) %*% error
dW1 = t(ann_model$X_train) %*% delta1
# bias
db2 = colSums(error)
db1 = colSums(delta1)
#################################
# add regularization terms (b1 and b2 don't have rt)
dW2 = dW2 + ann_model$reg_lambda * W2
dW1 = dW1 + ann_model$reg_lambda * W1
# update parameter (gradient descent)
W1 = W1 + -ann_model$epsilon * dW1
b1 = b1 + -ann_model$epsilon * db1
W2 = W2 + -ann_model$epsilon * dW2
b2 = b2 + -ann_model$epsilon * db2
# update parameters to the model
model = list("W1" = W1, "b1" = b1, "W2" = W2, "b2" = b2)
return(model) }
```
The score function calculates the rate of correctly classified objects.
```{r}
score = function(class_out, Y) { # class_out := output (classification)
count = 0
for (i in 1:length(Y))
{ if (Y[i] == class_out[i])
{ count = count + 1 } }
score = count/length(Y)
return(score) }
```
Now we run the network. In the end, we have the final score, that shows
the network performance on a totally unknown dataset, the test sample.
```{r}
vals = .2 # validation split
tests = .2 # test split
samples = load_dataset(vals,tests) # loading dataset
# training sample
X_train = samples$X_train
Y_train = samples$Y_train
n_train = nrow(X_train)
# validation sample
X_validation = samples$X_validation
Y_validation = samples$Y_validation
n_validation = nrow(X_validation)
# test sample
X_test = samples$X_test
Y_test = samples$Y_test
n_test = nrow(X_test)
# ann parameter
epsilon = 0.001 # learning rate
reg_lambda = 0.00 # regularization term
n_hlayer = 10 # hidden layer
n_input_dim = ncol(X_train)
n_passes = 1000
n_output_dim = 2 # output
ann_model = list("X_train" = X_train,
"Y_train" = Y_train,
"X_validation" = X_validation,
"Y_validation" = Y_validation,
"X_test" = X_test,
"Y_test" = Y_test,
"n_train" = n_train,
"n_validation" = n_validation,
"n_test" = n_test,
"epsilon" = epsilon,
"reg_lambda" = reg_lambda,
"n_hlayer" = n_hlayer,
"n_input_dim" = n_input_dim,
"n_passes" = n_passes,
"n_output_dim" = n_output_dim)
model = build_train_model(ann_model) # building and training ANN model
```
```{r}
score_final = score(predict(model, X_test), Y_test)
print(sprintf("Final Score: %f", score_final))
```
# Motivation for this work
Globular clusters (GCs) are important for the understanding of galaxies evolution. The study of such systems can benefit itself of current and future surveys (S-PLUS, J-PLUS, J-PAS), which brazilian comunity has access, which applies innovative photometric systems with broad-band and narrow-band filters. However, the separation of extra-galactic GCs from faint stars in these images is quite challenging. Recently our group developed a pipeline to detect GCs candidates in J-PLUS images (`GCFinder`), but the list of candidates shows a still high fraction of false positives: stars that are listed as GCs candidates by our pipeline. This problem has motivated the present project, in which we study the viability of using supervised machine learning to separate faint stars from extra-galactic GCs, using the multitude of colors available in these surveys. Thus we implemented the presented here artificial neural network trained with stars and GCs simulated data.
# Taking a look at our data
```{r message=FALSE, warning=FALSE}
# Loading packages
require(tidyverse)
```
Our data is stored in file `xdata_nl.txt`.
```{r}
datafile <- "./catalog/xdata+ssp.txt"
datasample <- read.csv(datafile, sep="")
colours = c('FUV-r', 'NUV-r', 'u-r', 'J378-r', 'J395-r', 'J410-r', 'J430-r', 'g-r', 'J515-r', 'bp-r', 'r-J660', 'r-G','r-i','r-rp', 'r-J861', 'r-z', 'r-J', 'r-H', 'r-Ks', 'r-W1', 'r-W2', 'r-W3', 'r-W4')
colnames(datasample) <- c('class',colours)
datasample$class = factor(datasample$class)
```
Let's take a look at the data:
```{r}
glimpse(datasample)
datasample %>%
ggplot(aes(x = `g-r`, fill = class)) +
geom_density(alpha = 0.5)
datasample %>%
ggplot(aes(x = `g-r`, y = `u-r`)) +
geom_point(aes(color = class), alpha = 0.5)
datasample %>%
ggplot(aes(x = `g-r`, y = `r-i`)) +
geom_point(aes(color = class), alpha = 0.5)
```