-
Notifications
You must be signed in to change notification settings - Fork 1
/
101_Combine_with_legacy_data_2022.Rmd
2648 lines (1977 loc) · 75.4 KB
/
101_Combine_with_legacy_data_2022.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "101 Combine with legacy data"
output:
html_document:
toc: true
toc_float: true
editor_options:
chunk_output_type: inline
---
**Combine legacy data (data until the year before last year) with new data (downloaded in script 100)**
**NOTE:** Before running this script, you must
* run script 802 on your own PC (downloads latest data from Nivabasen)
* this will create a file in the folder 'Files_to_Jupyterhub_2021' (if 2021 was the last year) named something like
- '01_df_2021_notstandard_2022-06-02.rds' (the last part is the date of creation)
* copy the resulting file to Jupyterhub, folder 'Input_data'
**NOTE: Check the results of '12 Data changes since last year'**
**Overview of this script**
1. Load libraries and functions which will be used
2. Data
- Last year's data (produced by script 802 on your PC)
- 'Legacy data', i.e. the data we used last year
3. Reformat last year's data so they conform with the legacy data
- Includes changing parameter names *
4. Pick last year's data
- If year needs to be fixed, do it here
- Also add data read from Excel sheets: NILU data and cod biol. effects
5. Fix units
6. Add parameter sums for PCBs, BDEs etc.
7. Add columns for dry weight and fat percentage (drawn from the data itself)
* NOTE: When changing parameter names, remember that parameter names are found different places:
- data_legacy - see part 3-d2 in this script (Triphenyl -> TPhT)
- PROREF values - found in "Proref_report_2017.xlsx" and "Proref_paper.xlsx" (both in Input_data)
- Possibly EQS - found in "Input_data/EQS_limits.xlsx"
- Code in a few places (only script 210?)
* The data is checked for duplicates repeatedly, in the following sections:
2a7, 3b3, 3g, 7, 8c, 10d, 11a
- Errors often shows in these checks
- These checks can be sped up using data.table (https://stackoverflow.com/a/7450633),
or possibly dtplyr::lazy_dt
## 1. Load libraries and functions
```{r, results='hide', message=FALSE, warning=FALSE}
library(dplyr)
library(purrr)
library(tidyr)
library(ggplot2)
library(lubridate)
library(readxl)
library(readr)
# Load self-made functions
source("002_Utility_functions.R")
source("101_Combine_with_legacy_data_functions.R")
```
### Set year
Note: there are still some hard-coded "2019" given in the code
```{r}
lastyear <- 2022
```
## 2. Data
### a1. Recently downloaded data
- Read and reformat data (by default, the most recent dataset made)
- In constrast to 2019 version, we use only data from NIVAbasen (2c in 2019 version)
- The file named '01_df_2019_notstandard_<date>' were made on DHJs PC using script 01 in project 'Milkys2_pc'
```{r, results='hold'}
filepattern <- paste0("01_df_", lastyear, "_notstandard_") # file name except date and extension
filepattern_with_extension <- paste0(filepattern, ".+.rds")
filenumber <- 1 # filenumber = 1 means "read the newest file"
# Get available files, sorted from newest to oldest
files <- dir("Input_data", pattern = filepattern_with_extension) %>% rev()
if (length(files) == 0){
stop("No files found for year ", lastyear)
} else {
# Info for user
cat("Reading file number ", filenumber, ", sorted from newest to oldest files:", sep = "")
cat("\n", files[filenumber])
cat("\n\n")
cat("If you want to read a different file, replace 'filenumber <- 1' or 'filename' with the file you want")
cat("\n")
cat("For instance, set 'filenumber <- 2' to read the second newest file")
cat("\n")
# Get filename and its date part
filename <- files[filenumber]
file_date <- substr(filename, nchar(filepattern) + 1, nchar(filepattern) + 10) # pick date part
# The date part of 'filename' (e.g., '2020-04-23')
# will be used in part 10, when we save the resulting file
dat_new1 <- readRDS(paste0("Input_data/", filename))
}
cat("\n")
# If you want to remove VALUE = NA, change FALSE to TRUE in the next line
if (FALSE){
n1 <- nrow(dat_new1)
dat_new1 <- dat_new1 %>%
filter(!is.na(VALUE))
n2 <- nrow(dat_new1)
message("\n", n1-n2, " rows with no value of VALUE were removed")
} else {
warning("There are ", sum(is.na(dat_new1$VALUE)), " rows with VALUE = NA. These have NOT been removed")
}
```
### a2. Check Fat and dry weight
Seems ok
```{r}
dat_new1 %>%
filter(NAME %in% c("Fettinnhold", "Tørrstoff %")) %>%
xtabs(~year(SAMPLE_DATE) + NAME, .)
```
### a3. Check of TBT
* TBT given as 'Tributyltinn (TBT)' (ion weight) and 'Tributyltinn (TBT)-Sn' (tin weight)
**Explanation**
TBT is given by two measurements:
- ion weight of TBT, called:
- 'TBT' in Access, 'Tributyltinn (TBT)' in Nivabasen, 'TBSN+' in ICES (current standard code)
- atom weight of tin in TBT, called:
- 'TBTIN' in Access, 'Tributyltinn (TBT)-Sn' in Nivabasen, 'TBTIN' in ICES (marked as 'legacy code')
- [ion weight] = 2.44*[atom weight]
As reference to the ICES codes, see
- https://vocab.ices.dk/?CodeID=33697
- http://vocab.ices.dk/?CodeID=78150
- See http://vocab.ices.dk/?ref=37 for vocabulary for DOME (version 3.2 Biota), record 10,parameter 'PARAM'
```{r}
dat_new1 %>%
filter(grepl("Tributyltinn", NAME)) %>%
group_by(NAME, UNIT, TISSUE_NAME) %>%
summarise(Mean_value = mean(VALUE), N = n(), .groups = "drop")
```
### a4. SPECIAL CASE FOR 2020 (hopefully): adjusted metabolites have been mislabeled
- Could rename them, for 2020 data we just remove them
```{r}
#
# SPECIAL CASE FOR 2020 (hopefully): adjusted metabolites have been mislabeled
#
sel <- dat_new1$NAME %in% c("PA1OH", "PYR1OH", "BAP3OH")
if (sum(sel) == 0){
message("No mislabeled adjusted metabolites")
} else {
stop("Seems to be some mislabeled adjusted metabolites. Consider to delete them.")
}
if (lastyear == 2020){
dat_new1 <- dat_new1[!sel,]
cat("dat_new1:", sum(sel), "records with mislabeled adjusted PAHs removed \n")
}
```
### a5. Check PAH metabolites in cod bile
- Unadjusted metabolites: "1-OH-fenantren", "1-OH-pyren", "3-OH-benzo[a]pyren" (used in 2020) are the same as PA1OH, PYR1OH, BAP3OH
- Adjusted metabolites (unadjusted divided by ABS 380): PA1O, PYR1O, BAP3O
```{r}
params <- c("PA1OH", "PYR1OH", "BAP3OH",
"PA1O", "PYR1O", "BAP3O",
"1-OH-fenantren", "1-OH-pyren", "3-OH-benzo[a]pyren")
#
# Check whether we have synonymous parameters for the same station
#
dat_check <- dat_new1 %>%
filter(NAME %in% params) %>%
mutate(NAME = factor(NAME, levels = params))
# View(dat_check)
tab <- xtabs(~NAME + STATION_CODE, dat_check)
tab1a <- tab[rownames(tab) %in% c("PYR1OH", "1-OH-pyren"),]
tab1b <- apply(tab1a > 0, 2, sum)
tab2a <- tab[rownames(tab) %in% c("PA1OH", "1-OH-fenantren"),]
tab2b <- apply(tab2a > 0, 2, sum)
tab3a <- tab[rownames(tab) %in% c("BAP3OH", "3-OH-benzo[a]pyren"),]
tab3b <- apply(tab3a > 0, 2, sum)
if (any(tab1b > 1) | any(tab2b > 1) | any(tab3b > 1)){
stop("There are synonymous parameters for the same station! (section 2-a5) \nEach station should have either PYR1OH or 1-OH-pyren, not both. Same for the other pairs. \n\nMost likely explanation: adjusted 1-OH-pyren (PYR1O) has been mislabeled PYR1OH. \n\nCheck tab1, tab2a, and tab3a. (Section 2-a5)")
}
dat_plot <- dat_new1 %>%
filter(NAME %in% params & !is.na(VALUE))
if (nrow(dat_plot) > 0){
ggplot(dat_plot, aes(x = VALUE)) +
geom_histogram() +
facet_wrap(vars(NAME), scales = "free_x")
# separate for "3-OH-benzo[a]pyren"
param_pick <- "1-OH-pyren"
param_pick <- "1-OH-fenantren"
param_pick <- "3-OH-benzo[a]pyren"
ggplot(dat_plot %>% filter(NAME == param_pick), # %>% View(),
aes(x = VALUE)) +
geom_histogram(bins = 30) +
facet_wrap(vars(FLAG1), nrow = 1) +
ggtitle(param_pick)
}
```
### a6. Remove adjusted PAH metabolites (PYR1O, PA1O, BAP3O)
```{r}
sel <- dat_new1$NAME %in% c("PA1O", "PYR1O", "BAP3O")
dat_new1 <- dat_new1[!sel,]
cat("dat_new1:", sum(sel), "records with adjusted PAHs removed\n")
```
### a7. Check for duplicates
```{r}
# Note NAME instead of PARAM (on contrast with checks further down)
df_duplicates <- dat_new1 %>%
add_count(STATION_CODE, LATIN_NAME, TISSUE_NAME, SAMPLE_DATE, SAMPLE_NO, NAME) %>%
filter(n > 1)
if (nrow(df_duplicates) > 0){
xtabs(~PARAM, df_duplicates) %>% print()
xtabs(~STATION_CODE, df_duplicates) %>% print()
stop("Duplicates in the data! Check 'df_duplicates'. (section 2-a7) \n")
} else {
cat("No duplicates found in the data. \n")
}
```
### a8. For eider duck, change names of CCP
* NO EIDER DUCK THIS YEAR
* Use "eksl. LOQ, that is most similar to cod/blue mussel usage
```{r}
sel <- with(dat_new1, STATION_CODE %in% "19N" & grepl("SCCP", NAME))
# View(dat_new1[sel,])
dat_new1$NAME[sel] <- "SCCP eksl. LOQ"
sel <- with(dat_new1, STATION_CODE %in% "19N" & grepl("MCCP", NAME))
# View(dat_new1[sel,])
dat_new1$NAME[sel] <- "MCCP eksl. LOQ"
```
### a9. Change SCCP and MCCP values
- Have already added missing rows in Milkys2_pc script 802
- Change 'SCCP eksl. LOQ' by setting VALUE = 0 and FLAG = NA for data < LOQ
- This is best for getting medians
- Add new parameter 'SCCP' which is just the same as the old 'SCCP eksl. LOQ'
- I.e.: For less-than data, LOQ in VALUE column and FLAG1 = '<'
- This is intended to be used for trends mainly
- Same for MCCP
```{r}
#
# SCCPs
#
cat("SCCP: \n\n")
# Check
#### ELU: forstår ikke denne sjekken...
check <- dat_new1 %>%
filter(NAME %in% c("SCCP eksl. LOQ", "SCCP inkl. LOQ"),
STATION_CODE != "19N") %>% # View("SCCP") # 19N doesn't have 'inkl LOQ'
xtabs(~NAME + STATION_CODE, .)
check #ELU
#if (!identical(check[1,], check[2,])){ #ELU
# stop("Row 1 and row 2 should be identical!") #ELU
#} #ELU
# Add rows with new parameter 'SCCP'
dat_to_add_SCCP <- dat_new1 %>%
filter(NAME %in% "SCCP eksl. LOQ")
dat_to_add_SCCP$NAME <- "SCCP"
if (sum(dat_new1$NAME %in% "SCCP") == 0)
dat_new1 <- bind_rows(dat_new1, dat_to_add_SCCP)
cat(nrow(dat_to_add_SCCP), "rows added to the data, NAME = SCCP")
# Change existing 'SCCP eksl. LOQ'
sel <- with(dat_new1, NAME %in% "SCCP eksl. LOQ" & FLAG1 %in% "<")
dat_new1$VALUE[sel] <- 0
dat_new1$FLAG1[sel] <- as.character(NA)
cat(sum(sel), "'SCCP eksl. LOQ' rows: VALUE changed to zero \n")
#
# MCCPs
#
cat("\n\n\nMCCP: \n\n")
#
# MCCPs
# ELU:
check <- dat_new1 %>%
filter(NAME %in% c("MCCP eksl. LOQ", "MCCP inkl. LOQ"),
STATION_CODE != "19N") %>% # 19N doesn't have 'inkl LOQ'
xtabs(~NAME + STATION_CODE, .)
check
#if (!identical(check[1,], check[2,])){
# stop("Row 1 and row 2 should be identical!")
#}
# Add rows with new parameter 'MCCP'
dat_to_add_MCCP <- dat_new1 %>%
filter(NAME %in% "MCCP eksl. LOQ")
dat_to_add_MCCP$NAME <- "MCCP"
if (sum(dat_new1$NAME %in% "MCCP") == 0)
dat_new1 <- bind_rows(dat_new1, dat_to_add_MCCP)
cat(nrow(dat_to_add_MCCP), "rows added to the data, NAME = MCCP")
# Change existing 'MCCP eksl. LOQ'
sel <- with(dat_new1, NAME %in% "MCCP eksl. LOQ" & FLAG1 %in% "<")
dat_new1$VALUE[sel] <- 0
dat_new1$FLAG1[sel] <- as.character(NA)
cat(sum(sel), "'MCCP eksl. LOQ' rows: VALUE changed to zero \n")
```
### b1. Read legacy data
The data go up to 2017 and combines data from the Access database (up to 2015) and NIVAbasen (2016-17).
```{r}
# Files
files <- list_files("Data", pattern = "101_data_updated")
files
# HARD_CODED: pick most recent data from last year
data_legacy <- readRDS("Data/101_data_updated_2022-09-23.rds") %>%
mutate(PARAM = case_when(
PARAM %in% "TTBTIN" ~ "TTBT",
TRUE ~ PARAM)
)
cat("\n")
cat("Lecacy data covers the years", min(data_legacy$MYEAR), "-", max(data_legacy$MYEAR), "\n")
check <- lastyear - max(data_legacy$MYEAR)
if (check <= 0){
stop("Some of the legacy data are from 'lastyear' and thus overlaps the new data! Pick an older file?")
} else if (check >= 2){
stop("There are missing years between legacy data and 'lastyear'! Pick a newer file?")
} else {
message("Legacy data years checked and found ok")
}
```
### b2. Check MCCP/SCCP in legacy data
#### SCCP
* Through 2014: Only 'SCCP' given
* In 2015-2017, equal amount of 'SCCP' and 'SCCP eksl. LOQ', only lacking 'eksl. LOQ' for eider
- See script 870 on Milkys2_pc: there was actually no lacking/zero data for 'SCCP eksl. LOQ' those years
* In 2018-2020, zero data for 'SCCP eksl. LOQ' probably have not entered the database
```{r}
#
# SCCP
#
df_SCCP_tall <- data_legacy %>%
filter(grepl("SCCP", PARAM)) %>%
select(STATION_CODE, LATIN_NAME, TISSUE_NAME, MYEAR, SAMPLE_NO2, PARAM, VALUE_WW, FLAG1)
xtabs(~PARAM + MYEAR , df_SCCP_tall)
df_SCCP <- data_legacy %>%
filter(grepl("SCCP", PARAM)) %>%
select(STATION_CODE, LATIN_NAME, TISSUE_NAME, MYEAR, SAMPLE_NO2, PARAM, VALUE_WW) %>%
tidyr::pivot_wider(names_from = PARAM, values_from = VALUE_WW)
# df_SCCP %>%
# filter(is.na(`SCCP eksl. LOQ`)) %>% View()
df_SCCP %>%
filter(is.na(`SCCP eksl. LOQ`)) %>%
xtabs(~STATION_CODE + MYEAR, .)
```
#### SCCP plot
- Add zeros for missing 'eksl. LOQ-rows', and plot for one station (2015-2020 only)
```{r}
if (FALSE){
xtabs(~is.na(SCCP) + is.na(`SCCP eksl. LOQ`) + MYEAR, df_SCCP %>% filter(MYEAR >= 2016))
xtabs(~MYEAR + is.na(`SCCP eksl. LOQ`), df_SCCP)
xtabs(~MYEAR + is.na(`SCCP eksl. LOQ`), df_SCCP)
}
# Add zeros and plot
station <- "30B"
df_SCCP %>%
mutate(`SCCP eksl. LOQ` = ifelse(is.na(`SCCP eksl. LOQ`), 0, `SCCP eksl. LOQ`)) %>%
filter(STATION_CODE %in% "30B" & MYEAR %in% 2015:2020) %>%
tidyr::pivot_longer(cols = c(SCCP, `SCCP eksl. LOQ`)) %>% # str()
ggplot(aes(value)) +
geom_histogram() +
facet_grid(rows = vars(MYEAR), cols = vars(name)) +
labs(title = station)
```
#### MCCP
* Same concusions as for SCCP, see above
```{r}
#
# MCCP
#
df_MCCP_tall <- data_legacy %>%
filter(grepl("MCCP", PARAM)) %>%
select(STATION_CODE, LATIN_NAME, TISSUE_NAME, MYEAR, SAMPLE_NO2, PARAM, VALUE_WW, FLAG1)
xtabs(~PARAM + MYEAR , df_MCCP_tall)
df_MCCP <- data_legacy %>%
filter(grepl("MCCP", PARAM)) %>%
select(STATION_CODE, LATIN_NAME, TISSUE_NAME, MYEAR, SAMPLE_NO2, PARAM, VALUE_WW) %>%
tidyr::pivot_wider(names_from = PARAM, values_from = VALUE_WW)
# df_MCCP %>%
# filter(is.na(`MCCP eksl. LOQ`)) %>% View()
df_MCCP %>%
filter(is.na(`MCCP eksl. LOQ`)) %>%
xtabs(~STATION_CODE + MYEAR, .)
```
#### MCCP plot
- Add zeros for missing 'eksl. LOQ-rows', and plot for one station (2015-2020 only)
```{r}
if (FALSE){
xtabs(~is.na(MCCP) + is.na(`MCCP eksl. LOQ`) + MYEAR, df_MCCP %>% filter(MYEAR >= 2016))
xtabs(~MYEAR + is.na(`MCCP eksl. LOQ`), df_MCCP)
xtabs(~MYEAR + is.na(`MCCP eksl. LOQ`), df_MCCP)
}
# Add zeros and plot for one station (2015-2020 only)
station <- "53B"
df_MCCP %>%
mutate(`MCCP eksl. LOQ` = ifelse(is.na(`MCCP eksl. LOQ`), 0, `MCCP eksl. LOQ`)) %>%
filter(STATION_CODE %in% station & MYEAR %in% 2015:2020) %>%
tidyr::pivot_longer(cols = c(MCCP, `MCCP eksl. LOQ`)) %>% # str()
ggplot(aes(value)) +
geom_histogram() +
facet_grid(rows = vars(MYEAR), cols = vars(name)) +
labs(title = station)
```
#### Fix part 1, as done in Milkys2_pc script 802
* 2015-2017 - just change name of 'SCCP' and 'MCCP'
* 2018-2020 - add extra rows
```{r}
# data_legacy_back <- data_legacy
# data_legacy <- data_legacy_back
########
# SCCP
########
#
# For all data 2015:2020
#
sel <- with(data_legacy, PARAM %in% "SCCP" & MYEAR %in% 2015:2020); sum(sel)
data_legacy$PARAM[sel] <- "SCCP inkl. LOQ"
#
# 2015-2020: need to add data for the empty rows (almost only 2018-2020)
#
df_lacking_SCCP <- df_SCCP %>%
filter(MYEAR %in% 2015:2020 & is.na(`SCCP eksl. LOQ`)) %>%
select(STATION_CODE, LATIN_NAME, TISSUE_NAME, MYEAR, SAMPLE_NO2) %>%
mutate(addSCCP = TRUE)
df_to_add_SCCP <- data_legacy %>%
filter(PARAM %in% "SCCP inkl. LOQ") %>%
left_join(df_lacking_SCCP, by = c("MYEAR", "STATION_CODE", "LATIN_NAME", "TISSUE_NAME", "SAMPLE_NO2")) %>%
filter(addSCCP)
df_to_add_SCCP$PARAM <- "SCCP eksl. LOQ"
df_to_add_SCCP$FLAG1 <- "<"
nrow(data_legacy)
data_legacy <- bind_rows(data_legacy, df_to_add_SCCP)
nrow(data_legacy)
########
# MCCP
########
#
# For all data 2015:2020 (almost only 2018-2020)
#
sel <- with(data_legacy, PARAM %in% "MCCP" & MYEAR %in% 2015:2020); sum(sel)
data_legacy$PARAM[sel] <- "MCCP inkl. LOQ"
#
# 2015-2020: need to add data for the empty rows
#
df_lacking_MCCP <- df_MCCP %>%
filter(MYEAR %in% 2015:2020 & is.na(`MCCP eksl. LOQ`)) %>%
select(STATION_CODE, LATIN_NAME, TISSUE_NAME, MYEAR, SAMPLE_NO2) %>%
mutate(addMCCP = TRUE)
df_to_add_MCCP <- data_legacy %>%
filter(PARAM %in% "MCCP inkl. LOQ") %>%
left_join(df_lacking_MCCP, by = c("MYEAR", "STATION_CODE", "LATIN_NAME", "TISSUE_NAME", "SAMPLE_NO2")) %>%
filter(addMCCP)
df_to_add_MCCP$PARAM <- "MCCP eksl. LOQ"
df_to_add_MCCP$FLAG1 <- "<"
nrow(data_legacy)
data_legacy <- bind_rows(data_legacy, df_to_add_MCCP)
nrow(data_legacy)
```
#### Fix art 2, as above
- But only for 2015-2020 data
- Change 'SCCP eksl. LOQ' by setting VALUE = 0 and FLAG = NA for data < LOQ
- This is best for getting medians
- Add new parameter 'SCCP' which is just the same as the old 'SCCP eksl. LOQ'
- This is intended to be used for trends mainly
### a9. Change SCCP and MCCP values
- Have already added missing rows in Milkys2_pc script 802
- Change 'SCCP eksl. LOQ' by setting VALUE = 0 and FLAG = NA for data < LOQ
- This is best for getting medians
- Add new parameter 'SCCP' which is just the same as the old 'SCCP eksl. LOQ'
- I.e.: For less-than data, LOQ in VALUE column and FLAG1 = '<'
- This is intended to be used for trends mainly
- Same for MCCP
```{r}
#
# SCCPs
#
cat("SCCP: \n\n")
# Check
check <- data_legacy %>%
filter(PARAM %in% c("SCCP eksl. LOQ", "SCCP inkl. LOQ"),
STATION_CODE != "19N",
MYEAR %in% 2015:2020) %>% # View("SCCP") # 19N doesn't have 'inkl LOQ'
xtabs(~PARAM + STATION_CODE, .)
check2 <- check[1,] - check[2,]
if (sum(check2) != 0){
warning("Row 1 and row 2 should be identical!")
# In this case they are different for I023 in 2015, nothing to do about that
}
# Add rows with new parameter 'SCCP'
dat_to_add_SCCP <- data_legacy %>%
filter(PARAM %in% "SCCP eksl. LOQ")
dat_to_add_SCCP$PARAM <- "SCCP"
if (sum(data_legacy$PARAM %in% "SCCP") == 0)
data_legacy <- bind_rows(data_legacy, dat_to_add_SCCP)
cat(nrow(dat_to_add_SCCP), "rows added to the data, PARAM = SCCP")
# Change existing 'SCCP eksl. LOQ'
sel <- with(data_legacy, PARAM %in% "SCCP eksl. LOQ" & FLAG1 %in% "<")
data_legacy$VALUE[sel] <- 0
data_legacy$FLAG1[sel] <- as.character(NA)
cat(sum(sel), "'SCCP eksl. LOQ' rows: VALUE changed to zero \n")
#
# MCCPs
#
cat("\n\n\nMCCP: \n\n")
#
# MCCPs
check <- data_legacy %>%
filter(PARAM %in% c("MCCP eksl. LOQ", "MCCP inkl. LOQ"),
STATION_CODE != "19N",
MYEAR %in% 2015:2020) %>% # 19N doesn't have 'inkl LOQ'
xtabs(~PARAM + STATION_CODE, .)
if (!identical(check[1,], check[2,])){
stop("Row 1 and row 2 should be identical!")
}
# Add rows with new parameter 'MCCP'
dat_to_add_MCCP <- data_legacy %>%
filter(PARAM %in% "MCCP eksl. LOQ")
dat_to_add_MCCP$PARAM <- "MCCP"
if (sum(data_legacy$PARAM %in% "MCCP") == 0)
data_legacy <- bind_rows(data_legacy, dat_to_add_MCCP)
cat(nrow(dat_to_add_MCCP), "rows added to the data, PARAM = MCCP")
# Change existing 'MCCP eksl. LOQ'
sel <- with(data_legacy, PARAM %in% "MCCP eksl. LOQ" & FLAG1 %in% "<")
data_legacy$VALUE[sel] <- 0
data_legacy$FLAG1[sel] <- as.character(NA)
cat(sum(sel), "'MCCP eksl. LOQ' rows: VALUE changed to zero \n")
```
### b3. Check legacy data for duplicates
(11-12 seconds)
```{r}
df_duplicates <- data_legacy %>%
add_count(STATION_CODE, LATIN_NAME, TISSUE_NAME, MYEAR, SAMPLE_NO2, PARAM) %>%
filter(n > 1)
if (nrow(df_duplicates) > 0){
xtabs(~PARAM, df_duplicates) %>% print()
xtabs(~MYEAR, df_duplicates) %>% print()
xtabs(~MYEAR + PARAM, df_duplicates) %>% print()
stop("Duplicates in the data! Check 'df_duplicates'. (Section 2.b3)\n")
} else {
cat("No duplicates found in the data. \n")
}
```
## 3. Reformat recent data to conform with legacy data
### a1. Create 'dat_new2'
```{r}
dat_new2 <- dat_new1 %>%
mutate(MYEAR = lastyear,
SAMPLE_NO2 = SAMPLE_NO,
BASIS = "W") # hard-coded
```
### b1. Fix parameter names
PAH metabolites in bile:
- "1-OH-fenantren", "1-OH-pyren", "3-OH-benzo[a]pyren" (used in 2020) are the same as PA1OH, PYR1OH, BAP3OH
```{r}
# df_nivabase: Set standard parameter names (PARAM) based on NAME
cat("dat_new2: Set standard parameter names (PARAM) \n")
dat_new2$PARAM <- get_standard_parametername(
dat_new2$NAME,
"Input_data/Lookup table - standard parameter names.csv"
)
# Fix PBDE and PCB substances
dat_new2$PARAM <- sub("BDE-", "BDE", dat_new2$PARAM, fixed = TRUE)
dat_new2$PARAM <- sub("PCB-", "CB", dat_new2$PARAM, fixed = TRUE)
# Extra changes
dat_new2 <- dat_new2 %>%
mutate(
PARAM = case_when(
PARAM %in% c("Ag", "As", "Cd", "Co", "Cr", "Cu", "Hg", "Ni", "Pb", "Sn", "Zn") ~ toupper(PARAM),
substr(PARAM,1,3) %in% "PCB" ~ sub("PCB ", "CB", PARAM, fixed = TRUE),
PARAM %in% "Sølv" ~ "AG",
PARAM %in% "Kvikksølv" ~ "HG",
PARAM %in% "Selen" ~ "SE",
PARAM %in% "Pentaklorbenzen (QCB)" ~ "QCB",
# the following names are not logical - but need to be like this in order to be conistent with old data.
# Will be changed in 10.
PARAM %in% "Dibutyltinn (DBT)" ~ "DBTIN",
PARAM %in% "Monobutyltinn (MBT)" ~ "MBTIN",
PARAM %in% "Tetrabutyltinn (TetraBT)" ~ "TTBT",
PARAM %in% "1-OH-pyren" ~ "PYR1OH",
PARAM %in% "1-OH-fenantren" ~ "PA1OH",
PARAM %in% "3-OH-benzo[a]pyren" ~ "BAP3OH",
grepl("oktametylsyklotetrasiloksan", PARAM) ~ "D4",
grepl("dekametylsyklopentasiloksan", PARAM) ~ "D5",
grepl("dodekametylsykloheksasiloksan", PARAM) ~ "D6",
grepl("Kortkjedede (SCCP)", PARAM, fixed = TRUE) ~ "SCCP inkl. LOQ",
grepl("Mellomkjedede (MCCP)", PARAM, fixed = TRUE) ~ "MCCP inkl. LOQ",
TRUE ~ PARAM)
)
sel <- dat_new2$NAME %in% "Tørrstoff %"
dat_new2$PARAM[sel] <- "DRYWT%"
cat("dat_new2: PARAM = DRYWT% set for", sum(sel), "records \n")
sel <- dat_new2$NAME %in% "Fettinnhold"
dat_new2$PARAM[sel] <- "Fett"
cat("dat_new2: PARAM = Fett set for", sum(sel), "records \n")
```
### b3. Check for duplicates
```{r}
df_duplicates <- dat_new2 %>%
add_count(STATION_CODE, LATIN_NAME, TISSUE_NAME, MYEAR, SAMPLE_NO2, PARAM) %>%
filter(n > 1)
if (nrow(df_duplicates) > 0){
xtabs(~PARAM, df_duplicates) %>% print()
xtabs(~STATION_CODE, df_duplicates) %>% print()
stop("Duplicates in the data! Check 'df_duplicates'. (section 3-b2) \n")
} else {
cat("No duplicates found in the data. \n")
}
```
### b4. Save data so far
```{r}
```
### c. Remove sums
```{r}
# 1. Some specific names
sel1 <- dat_new2$PARAM %in% c("Sum PCB(7) inkl. LOQ",
"Total 6 Ikke dioksinlike PCB inkl. LOQ",
"Sum PCB(7) eksl. LOQ",
"Total 6 Ikke dioksinlike PCB eksl. LOQ")
# 2. All starting with "Sum " or "sum ":
sel2 <- tolower(substr(dat_new2$PARAM, 1, 4)) == "sum "
sel <- sel1 | sel2
dat_new2 <- dat_new2[!sel,]
cat("dat_new2:", sum(sel), "records with sums deleted (will be recalculated) \n")
```
### d1. Check tins
**NOTE: see part 10 below**
By comparing with original report, we find that all tins in AqM, except TBT, are given as **tin (Sn) weight**. TBT is given as ion weight.
* See "K:\Prosjekter\Sjøvann\JAMP\2019\analyser\Analyserapporter\snegler\Analyserapport 925-7518 snegler.PDF"
- Exception: for the industry stations I965 and I969, it seems that tins are given as **ion weight**
(Original report for Milkys 2019 can be found at
`K:\Prosjekter\Sjøvann\JAMP\2019\analyser\Analyserapporter\snegler\Analyserapport 925-7518 snegler.PDF`)
Previous years - overview of names used
Substance | ION WEIGHT | TIN WEIGHT
---------------------|-----------------------------------|----------------------------------
BUTYLTINS | |
monobutyltin | MBTIN, "monobutyltin (MBT)" | Monobutyltinn (MBT)-Sn
dibutyltin | DBTIN | Dibutyltinn-Sn (DBT-Sn)
tributyltin | TBT | Tributyltinn (TBT)-Sn
tetrabutyltin | TTBT, "Tetrabutyltinn (TetraBT)" | Tetrabutyltinn (TTBT)-Sn
OCTYLTINS | |
monooctyltin | MOT | Monooktyltinn (MOT)-Sn
dioctyltin | DOT | Dioktyltinn-Sn (DOT-Sn)
CYCLOHEXYLTINS | |
tricyclohexyltin | TCHT |
PHENYLTINS | |
Triphenyltin (TPhT) | TPTIN - changed to TPhT in 2021 * | Trifenyltinn (TPhT)-Sn
* Not for legacy data in Nivadatabase, but in this procedure
Some of these (e.g. DBTIN and TPTIN for ion weight) are illogical but need to be like this in
order to conform with legacy data (data_legacy)
Examples from 11G in 2019:
* Tributyltinn
- Report says Tributyltinn (TBT) = <1.9, Tributyltinn (TBT)-Sn = <0.77
- Aquamonitor says TBT = <0.77
- Nivabase says Tributyltinn (TBT) = <1.9, Tributyltinn (TBT)-Sn = <0.77
* Triphenyltin C18-H15-Sn, ion weight 350.0, Sn weight 118.71 = [ion weight]*0.339
- Report says Trifenyltinn (TPhT) = <1.9, Trifenyltinn (TPhT)-Sn = <0.64
- Aquamonitor says TPhT = <0.64
- Nivabase says Trifenyltinn (TPhT) = <1.9, Trifenyltinn (TPhT)-Sn = <0.64
- For new data, 'Trifenyltinn (TPhT)' is translated to TPTIN (but see part 10)
```{r}
# Check first station in report - AqM:
dat_new2 %>%
filter(STATION_CODE == "11G" & year(SAMPLE_DATE) == lastyear) %>%
select(STATION_CODE, SAMPLE_DATE, NAME, PARAM, UNIT, VALUE, FLAG1) %>%
arrange(NAME)
if (TRUE){
# Check tetrabutyltin or triphenyltin in legacy daya
data_legacy %>%
# filter(PARAM %in% c("TTBT", "TTBTIN")) %>%
filter(grepl("TPhT", PARAM, ignore.case = TRUE) |
grepl("Trifenyltinn", PARAM, ignore.case = TRUE) |
grepl("TPTIN", PARAM, ignore.case = TRUE)) %>%
xtabs(~MYEAR + PARAM, .)
}
```
### d2. Fix Triphenyltin in legacy data
```{r}
# ONLY NEEDED TO DO ONCE
# sel <- data_legacy$PARAM %in% "TPTIN"
# data_legacy$PARAM[sel] <- "TPhT"
# message("data_legacy - ",sum(sel), " records: TPhT changed to TPTIN")
# saveRDS(data_legacy, "Data/101_data_updated_2020-08-05.rds")
```
### e1. Check species names
```{r}
table(addNA(dat_new2$LATIN_NAME))
```
### e2. Remove "unwanted species"
* Fucus vesiculosus = rockweed (measured on a Milkys station, but for a different project)
```{r}
n1 <- nrow(dat_new2)
dat_new2 <- dat_new2 %>%
filter(!LATIN_NAME %in% "Fucus vesiculosus")
n2 <- nrow(dat_new2)
cat("Number of rows before:", n1, "\n")
cat("Number of rows after:", n2, "\n")
cat("Difference:", n1-n2, "\n")
```
### f. Check units
- Will be fixed in section 4
```{r}
table(addNA(dat_new2$UNIT))
```
### g. Check and fix uniqueness of samples
* Creating `dat_new3`
```{r}
# Check uniqueness of fat and dry weight
check1 <- dat_new2 %>%
filter(PARAM %in% c("DRYWT%", "Fett")) %>% # View()
group_by(MYEAR, STATION_CODE, LATIN_NAME, TISSUE_NAME, SAMPLE_NO2, PARAM) %>%
mutate(n = n()) %>%
filter(n > 1) %>%
arrange(SAMPLE_NO2, PARAM) %>%
select(LATIN_NAME, SAMPLE_NO2, SAMPLE_ID, PARAM, VALUE, FLAG1, QUANTIFICATION_LIMIT)
cat("Duplicated data, fat and dry weight")
table(check1$STATION_CODE)
# USed below
tab <- xtabs(~SAMPLE_NO2 + SAMPLE_ID + STATION_CODE, check1)
# Check uniqueness of all variables
check2 <- dat_new2 %>%
group_by(MYEAR, STATION_CODE, LATIN_NAME, TISSUE_NAME, SAMPLE_NO2, PARAM) %>%
mutate(n = n()) %>%
filter(n > 1) %>%
arrange(SAMPLE_NO2, PARAM)
cat("Duplicated data, all parameters")
table(check2$STATION_CODE)
# Make Parameter x SAMPLE_ID table
check3 <- dat_new2 %>%
filter(STATION_CODE %in% "26A2") %>%
select(SAMPLE_NO2, PARAM, SAMPLE_ID, VALUE) %>%
pivot_wider(names_from = SAMPLE_ID, values_from = VALUE) %>%
arrange(SAMPLE_NO2, PARAM)
# View(check3)
if (FALSE){
# For checkin which SAMPLE_ID is the right ones
check4 <- dat_new2 %>%
filter(PARAM %in% c("HBCDA", "Fett")) %>% # View()
group_by(MYEAR, STATION_CODE, LATIN_NAME, TISSUE_NAME, SAMPLE_NO2, PARAM) %>%
mutate(n = n()) %>%
filter(n > 1) %>%
arrange(SAMPLE_NO2, PARAM) %>%
select(LATIN_NAME, SAMPLE_NO2, SAMPLE_ID, PARAM, VALUE, FLAG1, QUANTIFICATION_LIMIT)
View(check4)
# After comparing values with those found in PDF file 'Analyserapport 1055-9485 26A2.PDF' in
# K:\Prosjekter\Sjøvann\JAMP\2020\analyser\analyserapporter\blåskjell
# Sample 1 = SAMPLE_ID 245999
# Sample 2 = SAMPLE_ID 246000
# Sample 3 = SAMPLE_ID 246001
}
# 2021: Nothing is deleted
ids <- colnames(tab)
ids_to_delete <- ids[!ids %in% c("245999", "246000", "246001")]
cat("ids_to_delete: \n")
ids_to_delete