forked from jbloomlab/SARS-CoV-2-RBD_DMS
-
Notifications
You must be signed in to change notification settings - Fork 0
/
single_mut_effects.Rmd
843 lines (638 loc) · 88.4 KB
/
single_mut_effects.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
---
title: "Derive final single mutation functional scores from global epistasis fits"
author: "Tyler Starr"
date: "5/11/2020"
output:
github_document:
toc: true
html_preview: false
editor_options:
chunk_output_type: inline
---
This notebook reads in the coefficients from the global epistasis fits for binding and expression. We assess correlations between models and directly measured single mutants to decide on which models to use moving forward. We then consolidate our final mutant function scores data table. Finally, we validate our DMS measurements via comparison to isogenic and literature reported functional measurements.
```{r setup, message=FALSE, warning=FALSE, error=FALSE}
require("knitr")
knitr::opts_chunk$set(echo = T)
knitr::opts_chunk$set(dev.args = list(png = list(type = "cairo")))
#list of packages to install/load
packages = c("yaml","data.table","tidyverse","gridExtra")
#install any packages not already installed
installed_packages <- packages %in% rownames(installed.packages())
if(any(installed_packages == F)){
install.packages(packages[!installed_packages])
}
#load packages
invisible(lapply(packages, library, character.only=T))
#read in config file
config <- read_yaml("config.yaml")
#read in file giving concordance between RBD numbering and SARS-CoV-2 Spike numbering
RBD_sites <- read.csv(file="data/RBD_sites.csv",stringsAsFactors=F)
#make output directory
if(!file.exists(config$single_mut_effects_dir)){
dir.create(file.path(config$single_mut_effects_dir))
}
```
Session info for reproducing environment:
```{r print_sessionInfo}
sessionInfo()
```
## Setup
Read in tables of per-barcode measured and predicted phenotypes. With these, we can reproduce the plots from the global epistasis notebooks.
```{r bc_tables, fig.width=6,fig.height=9,fig.align="center", dpi=300,dev="png", echo=F}
bc_bind <- data.table(read.csv(file=config$global_epistasis_binding_file,stringsAsFactors = F))
bc_expr <- data.table(read.csv(file=config$global_epistasis_expr_file,stringsAsFactors = F))
#reproduce figures from global epistasis notebooks for illustrating shape (and bc-level scatter) from curve fits:
par(mfrow=c(3,2))
plot(bc_bind[library=="lib1",latent_phenotype_Cauchy_1],bc_bind[library=="lib1",log10Ka],main="bind, lib1",pch=16,col="#00000002",xlab="global epistasis latent phenotype",ylab="DMS measured log10(Ka)")
points(bc_bind[library=="lib1",latent_phenotype_Cauchy_1][order(bc_bind[library=="lib1",latent_phenotype_Cauchy_1])],bc_bind[library=="lib1",predicted_phenotype_Cauchy_1][order(bc_bind[library=="lib1",latent_phenotype_Cauchy_1])],type="l",col="red",lwd=1.5)
plot(bc_bind[library=="lib2",latent_phenotype_Cauchy_2],bc_bind[library=="lib2",log10Ka],main="bind,lib2",pch=16,col="#00000002",xlab="global epistasis latent phenotype",ylab="DMS measured log10(Ka)")
points(bc_bind[library=="lib2",latent_phenotype_Cauchy_2][order(bc_bind[library=="lib2",latent_phenotype_Cauchy_2])],bc_bind[library=="lib2",predicted_phenotype_Cauchy_2][order(bc_bind[library=="lib2",latent_phenotype_Cauchy_2])],type="l",col="red",lwd=1.5)
plot(bc_expr[library=="lib1",latent_phenotype_Cauchy_1],bc_expr[library=="lib1",delta_ML_meanF],main="expr, lib1",pch=16,col="#00000002",xlab="global epistasis latent phenotype",ylab="DMS measured log(MFI)")
points(bc_expr[library=="lib1",latent_phenotype_Cauchy_1][order(bc_expr[library=="lib1",latent_phenotype_Cauchy_1])],bc_expr[library=="lib1",predicted_phenotype_Cauchy_1][order(bc_expr[library=="lib1",latent_phenotype_Cauchy_1])],type="l",col="red",lwd=1.5)
plot(bc_expr[library=="lib2",latent_phenotype_Cauchy_2],bc_expr[library=="lib2",delta_ML_meanF],main="expr,lib2",pch=16,col="#00000002",xlab="global epistasis latent phenotype",ylab="DMS measured log(MFI)")
points(bc_expr[library=="lib2",latent_phenotype_Cauchy_2][order(bc_expr[library=="lib2",latent_phenotype_Cauchy_2])],bc_expr[library=="lib2",predicted_phenotype_Cauchy_2][order(bc_expr[library=="lib2",latent_phenotype_Cauchy_2])],type="l",col="red",lwd=1.5)
#zoom in on expression scale to see what's happening in the main bulk, not stretching the scale for e.g. multiple stop mutants that drive the weird patterning
plot(bc_expr[library=="lib1" & latent_phenotype_Cauchy_1 > -5,latent_phenotype_Cauchy_1],bc_expr[library=="lib1" & latent_phenotype_Cauchy_1 > -5,delta_ML_meanF],main="expr, lib1",pch=16,col="#00000002",xlab="global epistasis latent phenotype",ylab="DMS measured log(MFI)",xlim=c(-5,0.25))
points(bc_expr[library=="lib1" & latent_phenotype_Cauchy_1 > -5,latent_phenotype_Cauchy_1][order(bc_expr[library=="lib1" & latent_phenotype_Cauchy_1 > -5,latent_phenotype_Cauchy_1])],bc_expr[library=="lib1" & latent_phenotype_Cauchy_1 > -5,predicted_phenotype_Cauchy_1][order(bc_expr[library=="lib1" & latent_phenotype_Cauchy_1 > -5,latent_phenotype_Cauchy_1])],type="l",col="red",lwd=1.5)
plot(bc_expr[library=="lib2" & latent_phenotype_Cauchy_2 > -5,latent_phenotype_Cauchy_2],bc_expr[library=="lib2" & latent_phenotype_Cauchy_2 > -5,delta_ML_meanF],main="expr,lib2",pch=16,col="#00000002",xlab="global epistasis latent phenotype",ylab="DMS measured log(MFI)",xlim=c(-5,0.25))
points(bc_expr[library=="lib2" & latent_phenotype_Cauchy_2 > -5,latent_phenotype_Cauchy_2][order(bc_expr[library=="lib2"& latent_phenotype_Cauchy_2 > -5,latent_phenotype_Cauchy_2])],bc_expr[library=="lib2"& latent_phenotype_Cauchy_2 > -5,predicted_phenotype_Cauchy_2][order(bc_expr[library=="lib2"& latent_phenotype_Cauchy_2 > -5,latent_phenotype_Cauchy_2])],type="l",col="red",lwd=1.5)
invisible(dev.print(pdf, paste(config$single_mut_effects_dir,"/global-epistasis-shapes.pdf",sep="")))
```
We also read in all parameters from global epistasis models, and collapse down to individual tables reporting the lib1, lib2, and joint model inferred effects.
```{r beta_tables}
#open and merge together Cauchy likelihood, binding models
betas_bind_observed_Cauchy <- merge(read.csv('results/global_epistasis_binding/Cauchy-predicted-effects_binding_1.csv',stringsAsFactors=F),
read.csv('results/global_epistasis_binding/Cauchy-predicted-effects_binding_2.csv',stringsAsFactors=F),
by=c("site","mutation","wildtype","mutant"),all=T,sort=T,suffixes=c("_lib1","_lib2"));
betas_bind_observed_Cauchy <- merge(betas_bind_observed_Cauchy,
read.csv('results/global_epistasis_binding/Cauchy-predicted-effects_binding_joint.csv',stringsAsFactors=F),
by=c("site","mutation","wildtype","mutant"),all=T);names(betas_bind_observed_Cauchy)[which(names(betas_bind_observed_Cauchy)=="effect")] <- "effect_joint"
betas_bind_latent_Cauchy <- merge(read.csv('results/global_epistasis_binding/Cauchy-latent-effects_binding_1.csv',stringsAsFactors=F),
read.csv('results/global_epistasis_binding/Cauchy-latent-effects_binding_2.csv',stringsAsFactors=F),
by=c("site","mutation","wildtype","mutant"),all=T,sort=T,suffixes=c("_lib1","_lib2"));
betas_bind_latent_Cauchy <- merge(betas_bind_latent_Cauchy,
read.csv('results/global_epistasis_binding/Cauchy-latent-effects_binding_joint.csv',stringsAsFactors=F),
by=c("site","mutation","wildtype","mutant"),all=T);names(betas_bind_latent_Cauchy)[which(names(betas_bind_latent_Cauchy)=="effect")] <- "effect_joint"
betas_bind_nonepistatic_Cauchy <- merge(read.csv('results/global_epistasis_binding/nonepistatic-Cauchy-predicted-effects_binding_1.csv',stringsAsFactors=F),
read.csv('results/global_epistasis_binding/nonepistatic-Cauchy-predicted-effects_binding_2.csv',stringsAsFactors=F),
by=c("site","mutation","wildtype","mutant"),all=T,sort=T,suffixes=c("_lib1","_lib2"));
betas_bind_nonepistatic_Cauchy <- merge(betas_bind_nonepistatic_Cauchy,
read.csv('results/global_epistasis_binding/nonepistatic-Cauchy-predicted-effects_binding_joint.csv',stringsAsFactors=F),
by=c("site","mutation","wildtype","mutant"),all=T);names(betas_bind_nonepistatic_Cauchy)[which(names(betas_bind_nonepistatic_Cauchy)=="effect")] <- "effect_joint"
#open and merge together Gaussian likelihood, binding models
betas_bind_observed_Gaussian <- merge(read.csv('results/global_epistasis_binding/Gaussian-predicted-effects_binding_1.csv',stringsAsFactors=F),
read.csv('results/global_epistasis_binding/Gaussian-predicted-effects_binding_2.csv',stringsAsFactors=F),
by=c("site","mutation","wildtype","mutant"),all=T,sort=T,suffixes=c("_lib1","_lib2"));
betas_bind_observed_Gaussian <- merge(betas_bind_observed_Gaussian,
read.csv('results/global_epistasis_binding/Gaussian-predicted-effects_binding_joint.csv',stringsAsFactors=F),
by=c("site","mutation","wildtype","mutant"),all=T);names(betas_bind_observed_Gaussian)[which(names(betas_bind_observed_Gaussian)=="effect")] <- "effect_joint"
betas_bind_latent_Gaussian <- merge(read.csv('results/global_epistasis_binding/Gaussian-latent-effects_binding_1.csv',stringsAsFactors=F),
read.csv('results/global_epistasis_binding/Gaussian-latent-effects_binding_2.csv',stringsAsFactors=F),
by=c("site","mutation","wildtype","mutant"),all=T,sort=T,suffixes=c("_lib1","_lib2"));
betas_bind_latent_Gaussian <- merge(betas_bind_latent_Gaussian,
read.csv('results/global_epistasis_binding/Gaussian-latent-effects_binding_joint.csv',stringsAsFactors=F),
by=c("site","mutation","wildtype","mutant"),all=T);names(betas_bind_latent_Gaussian)[which(names(betas_bind_latent_Gaussian)=="effect")] <- "effect_joint"
betas_bind_nonepistatic_Gaussian <- merge(read.csv('results/global_epistasis_binding/nonepistatic-Gaussian-predicted-effects_binding_1.csv',stringsAsFactors=F),
read.csv('results/global_epistasis_binding/nonepistatic-Gaussian-predicted-effects_binding_2.csv',stringsAsFactors=F),
by=c("site","mutation","wildtype","mutant"),all=T,sort=T,suffixes=c("_lib1","_lib2"));
betas_bind_nonepistatic_Gaussian <- merge(betas_bind_nonepistatic_Gaussian,
read.csv('results/global_epistasis_binding/nonepistatic-Gaussian-predicted-effects_binding_joint.csv',stringsAsFactors=F),
by=c("site","mutation","wildtype","mutant"),all=T);names(betas_bind_nonepistatic_Gaussian)[which(names(betas_bind_nonepistatic_Gaussian)=="effect")] <- "effect_joint"
#open and merge together Cauchy likelihood, expression models
betas_expr_observed_Cauchy <- merge(read.csv('results/global_epistasis_expression/Cauchy-predicted-effects_expression_1.csv',stringsAsFactors=F),
read.csv('results/global_epistasis_expression/Cauchy-predicted-effects_expression_2.csv',stringsAsFactors=F),
by=c("site","mutation","wildtype","mutant"),all=T,sort=T,suffixes=c("_lib1","_lib2"));
betas_expr_observed_Cauchy <- merge(betas_expr_observed_Cauchy,
read.csv('results/global_epistasis_expression/Cauchy-predicted-effects_expression_joint.csv',stringsAsFactors=F),
by=c("site","mutation","wildtype","mutant"),all=T);names(betas_expr_observed_Cauchy)[which(names(betas_expr_observed_Cauchy)=="effect")] <- "effect_joint"
betas_expr_latent_Cauchy <- merge(read.csv('results/global_epistasis_expression/Cauchy-latent-effects_expression_1.csv',stringsAsFactors=F),
read.csv('results/global_epistasis_expression/Cauchy-latent-effects_expression_2.csv',stringsAsFactors=F),
by=c("site","mutation","wildtype","mutant"),all=T,sort=T,suffixes=c("_lib1","_lib2"));
betas_expr_latent_Cauchy <- merge(betas_expr_latent_Cauchy,
read.csv('results/global_epistasis_expression/Cauchy-latent-effects_expression_joint.csv',stringsAsFactors=F),
by=c("site","mutation","wildtype","mutant"),all=T);names(betas_expr_latent_Cauchy)[which(names(betas_expr_latent_Cauchy)=="effect")] <- "effect_joint"
betas_expr_nonepistatic_Cauchy <- merge(read.csv('results/global_epistasis_expression/nonepistatic-Cauchy-predicted-effects_expression_1.csv',stringsAsFactors=F),
read.csv('results/global_epistasis_expression/nonepistatic-Cauchy-predicted-effects_expression_2.csv',stringsAsFactors=F),
by=c("site","mutation","wildtype","mutant"),all=T,sort=T,suffixes=c("_lib1","_lib2"));
betas_expr_nonepistatic_Cauchy <- merge(betas_expr_nonepistatic_Cauchy,
read.csv('results/global_epistasis_expression/nonepistatic-Cauchy-predicted-effects_expression_joint.csv',stringsAsFactors=F),
by=c("site","mutation","wildtype","mutant"),all=T);names(betas_expr_nonepistatic_Cauchy)[which(names(betas_expr_nonepistatic_Cauchy)=="effect")] <- "effect_joint"
#open and merge together Gaussian likelihood, expression models
betas_expr_observed_Gaussian <- merge(read.csv('results/global_epistasis_expression/Gaussian-predicted-effects_expression_1.csv',stringsAsFactors=F),
read.csv('results/global_epistasis_expression/Gaussian-predicted-effects_expression_2.csv',stringsAsFactors=F),
by=c("site","mutation","wildtype","mutant"),all=T,sort=T,suffixes=c("_lib1","_lib2"));
betas_expr_observed_Gaussian <- merge(betas_expr_observed_Gaussian,
read.csv('results/global_epistasis_expression/Gaussian-predicted-effects_expression_joint.csv',stringsAsFactors=F),
by=c("site","mutation","wildtype","mutant"),all=T);names(betas_expr_observed_Gaussian)[which(names(betas_expr_observed_Gaussian)=="effect")] <- "effect_joint"
betas_expr_latent_Gaussian <- merge(read.csv('results/global_epistasis_expression/Gaussian-latent-effects_expression_1.csv',stringsAsFactors=F),
read.csv('results/global_epistasis_expression/Gaussian-latent-effects_expression_2.csv',stringsAsFactors=F),
by=c("site","mutation","wildtype","mutant"),all=T,sort=T,suffixes=c("_lib1","_lib2"));
betas_expr_latent_Gaussian <- merge(betas_expr_latent_Gaussian,
read.csv('results/global_epistasis_expression/Gaussian-latent-effects_expression_joint.csv',stringsAsFactors=F),
by=c("site","mutation","wildtype","mutant"),all=T);names(betas_expr_latent_Gaussian)[which(names(betas_expr_latent_Gaussian)=="effect")] <- "effect_joint"
betas_expr_nonepistatic_Gaussian <- merge(read.csv('results/global_epistasis_expression/nonepistatic-Gaussian-predicted-effects_expression_1.csv',stringsAsFactors=F),
read.csv('results/global_epistasis_expression/nonepistatic-Gaussian-predicted-effects_expression_2.csv',stringsAsFactors=F),
by=c("site","mutation","wildtype","mutant"),all=T,sort=T,suffixes=c("_lib1","_lib2"));
betas_expr_nonepistatic_Gaussian <- merge(betas_expr_nonepistatic_Gaussian,
read.csv('results/global_epistasis_expression/nonepistatic-Gaussian-predicted-effects_expression_joint.csv',stringsAsFactors=F),
by=c("site","mutation","wildtype","mutant"),all=T);names(betas_expr_nonepistatic_Gaussian)[which(names(betas_expr_nonepistatic_Gaussian)=="effect")] <- "effect_joint"
```
## Assessing global epistasis models for binding data
Next, for the global epistasis models built on the binding measurements, let's look at the correlation between model coefficients inferred from lib1 and lib2 binding measurements. We look at the model coefficients both on the "observed" log<sub>10</sub>(*K*<sub>A,app</sub>) and underlying "latent" scales, as well as without any global epistasis nonlinearity.
```{r betas_bind_lib1_v_lib2, fig.width=8,fig.height=12.5,fig.align="center", dpi=300,dev="png"}
par(mfrow=c(3,2))
#observed "log10Ka" scale
#Cauchy
x <- betas_bind_observed_Cauchy$effect_lib1; y <- betas_bind_observed_Cauchy$effect_lib2; fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="lib1 beta (observed scale)",ylab="lib2 beta (observed scale)",main=paste("binding, Cauchy observed, R-squared:",round(summary(fit)$r.squared,digits=3)))
#Gaussian
x <- betas_bind_observed_Gaussian$effect_lib1; y <- betas_bind_observed_Gaussian$effect_lib2; fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="lib1 beta (observed scale)",ylab="lib2 beta (observed scale)",main=paste("binding, Gaussian observed, R-squared:",round(summary(fit)$r.squared,digits=3)))
#underlying latent scale
#Cauchy
x <- betas_bind_latent_Cauchy$effect_lib1; y <- betas_bind_latent_Cauchy$effect_lib2; fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="lib1 beta (latent scale)",ylab="lib2 beta (latent scale)",main=paste("binding, Cauchy latent, R-squared:",round(summary(fit)$r.squared,digits=3)))
#Gaussian
x <- betas_bind_latent_Gaussian$effect_lib1; y <- betas_bind_latent_Gaussian$effect_lib2; fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="lib1 beta (latent scale)",ylab="lib2 beta (latent scale)",main=paste("binding, Gaussian latent, R-squared:",round(summary(fit)$r.squared,digits=3)))
#no nonlinear transform
#Cauchy
x <- betas_bind_nonepistatic_Cauchy$effect_lib1; y <- betas_bind_nonepistatic_Cauchy$effect_lib2; fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="lib1 beta (no nonlinearity)",ylab="lib2 beta (no nonlinearity)",main=paste("binding, Cauchy nonepistatic, R-squared:",round(summary(fit)$r.squared,digits=3)))
#Gaussian
x <- betas_bind_nonepistatic_Gaussian$effect_lib1; y <- betas_bind_nonepistatic_Gaussian$effect_lib2; fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="lib1 beta (no nonlinearity)",ylab="lib2 beta (no nonlinearity)",main=paste("binding, Gaussian nonepistatic, R-squared:",round(summary(fit)$r.squared,digits=3)))
invisible(dev.print(pdf, paste(config$single_mut_effects_dir,"/correlation_global-epistasis-coefs_bind.pdf",sep="")))
```
We can see, reassuringly, that our single mutation effect estimates are better when we incorporate the nonlinearity -- this shows that the global epistasis transform improves this single mutation deconvolution from the multi-mutants in the library. The Cauchy likelihood models show better correlation between replicates. This may be in part because the "shape" of global epistasis differed between libraries in the two Gaussian likelihood fits.
One interesting observation is that, in the latent space (which at least from my Ab work, I suspected was also picking up on the *stability* effects of mutations), there do seem to be a substantial number of mutations with positive values -- but when transformed to the observed log<sub>10</sub>(*K*<sub>A,app</sub>) scale, these are squashed to zero from the global epistasis fits. Let's take a look at what positions these positive-latent-effect mutations are observed in. The code below outputs the sites with the largest observed latent-effect mutational effects in the two library replicates.
These positions are visualized on the ACE2-bound RBD structure [here](https://dms-view.github.io/?pdb-url=https%3A%2F%2Fraw.githubusercontent.com%2Fdms-view%2FSARS-CoV-2%2Fmaster%2Fdata%2FSpike%2FBloomLab2020%2F6m0j.pdb&markdown-url=https%3A%2F%2Fraw.githubusercontent.com%2Fdms-view%2FSARS-CoV-2%2Fmaster%2Fdata%2FSpike%2FBloomLab2020%2FBloomLab_rbd.md&data-url=https%3A%2F%2Fraw.githubusercontent.com%2Fdms-view%2FSARS-CoV-2%2Fmaster%2Fdata%2FSpike%2FBloomLab2020%2Fresults%2FBloomLab2020_rbd.csv&condition=natural+frequencies&site_metric=site_entropy&mutation_metric=mut_frequency&selected_sites=337%2C358%2C363%2C365%2C367%2C452%2C460%2C493%2C498%2C501%2C518%2C519%2C527). These mutations fall into two sorts of clusters -- some are ACE2-proximal and may be genuine affinity-mediated positive effects, and others are ACE2-distal and are more likely stability/expression-mediated. My hypothesis is that these positive latent-scale mutations are frequently stabilizing mutations that have a positive latent effect because of positive effects on binding in destabilized background, but do not cause further affinity gains in stabilized background like WT. Let's hold onto this for a bit and come back to it when we look at our directly measured single-mutant barcode mutational effects.
```{r positive_latent_effects}
for(i in 1:nrow(betas_bind_latent_Cauchy)){
betas_bind_latent_Cauchy$site_S[i] <- RBD_sites[RBD_sites$site_RBD==betas_bind_latent_Cauchy$site[i],"site_SARS2"]
}
betas_bind_latent_Cauchy[!is.na(betas_bind_latent_Cauchy$effect_lib1) & !is.na(betas_bind_latent_Cauchy$effect_lib2) & betas_bind_latent_Cauchy$effect_lib1>0.25 & betas_bind_latent_Cauchy$effect_lib2>0.25,c("mutation","site_S")]
unique(betas_bind_latent_Cauchy[!is.na(betas_bind_latent_Cauchy$effect_lib1) & !is.na(betas_bind_latent_Cauchy$effect_lib2) & betas_bind_latent_Cauchy$effect_lib1>0.25 & betas_bind_latent_Cauchy$effect_lib2>0.25,"site_S"])
```
Next, let's look at mutational effects on binding as inferred directly in the Tite-seq assay from barcodes carrying only single mutations. First, let's look at what fraction of mutations were sampled as sole mutations on at least one barcode background in each library. (This differs from what's reported in the `build_variants.ipynb` notebook, as we are now quantifying coverage among barcodes *for which we determined a QC-filtered phenotype*.) We also calculate the total number of barcodes on which a variant is sampled, and make ecdf plots illustrating barcode counts among determined phenotypes. The top two plots show, for pooled libraries, the total number of bcs and the number of single-mutant barcodes on which each mutation is found. The bottom plots show something slightly different, related to the mutational effects we will be taking forward later down in the analysis. This plot shows the number of barcodes on which each type of measurement that we take forth is made on -- for single-lib mutational measurements in which there is no direct 1mut bc sampled, we show the cdf for number of (multi-mutant) bcs on which a mutation was sampled; and for single-lib mutational measurements in which there is at least one 1mut direct bc sampled, the cdf of the number of these 1mut barcodes on which the mutation was sampled.
```{r bind_effects_direct_singles, fig.width=9,fig.height=9,fig.align="center", dpi=300,dev="png"}
#build master "betas" data table that lists *all* mutations, including those that are undetermined in the model outputs
AAs <- c("A","C","D","E","F","G","H","I","K","L","M","N","P","Q","R","S","T","V","W","Y","*")
betas <- data.frame()
for(i in 1:nrow(RBD_sites)){
to.add <- data.frame(site_RBD=rep(RBD_sites$site_RBD[i],21), site_SARS2=rep(RBD_sites$site_SARS2[i],21), wildtype=rep(RBD_sites$amino_acid_SARS2[i],21), mutant=AAs, mutation=NA, mutation_RBD=NA)
for(j in 1:nrow(to.add)){
to.add$mutation[j] <- paste(to.add$wildtype[j],to.add$site_SARS2[j],to.add$mutant[j],sep="")
to.add$mutation_RBD[j] <- paste(to.add$wildtype[j],to.add$site_RBD[j],to.add$mutant[j],sep="")
}
betas <- rbind(betas,to.add)
}
betas$wildtype <- as.character(betas$wildtype); betas$mutant <- as.character(betas$mutant)
betas <- data.table(betas)
bc_bind[,aa_subs_list := list(strsplit(aa_substitutions,split=" ")),by=.(library,barcode)]
#gives total number of barcodes with a determined binding phenotype in each library on which a genotype was sampled (takes a while to compute)
betas[,n_bc_bind_lib1 := sum(unlist(lapply(bc_bind[library=="lib1" & !is.na(log10Ka),aa_subs_list], function(x) mutation_RBD %in% x))),by=mutation]
betas[,n_bc_bind_lib2 := sum(unlist(lapply(bc_bind[library=="lib2" & !is.na(log10Ka),aa_subs_list], function(x) mutation_RBD %in% x))),by=mutation]
for(i in 1:nrow(betas)){
delta_log10Ka_1 <- bc_bind[aa_substitutions==betas[i,"mutation_RBD"] & library=="lib1",delta_log10Ka]
delta_log10Ka_2 <- bc_bind[aa_substitutions==betas[i,"mutation_RBD"] & library=="lib2",delta_log10Ka]
betas$n_bc_1mut_bind_lib1[i] <- sum(!is.na(delta_log10Ka_1)) #gives number of single-mutant barcodes with a phenotype on which a genotype was sampled
betas$n_bc_1mut_bind_lib2[i] <- sum(!is.na(delta_log10Ka_2))
betas$bind_lib1_direct[i] <- mean(delta_log10Ka_1,na.rm=T)
betas$bind_lib2_direct[i] <- mean(delta_log10Ka_2,na.rm=T)
}
par(mfrow=c(2,2))
#output ecdf plots: first, for all mutants, counts across all barcodes and across 1mut barcodes
plot(ecdf(betas[mutant!="*" & wildtype!=mutant,n_bc_bind_lib1+n_bc_bind_lib2]),xlim=c(0,200),verticals=T,pch=NA,col.01line=NA,main=paste("pooled binding scores, median =",median(betas[mutant!="*" & wildtype!=mutant,n_bc_bind_lib1+n_bc_bind_lib2])),xlab="number of barcodes",ylab="fraction mutations found < X times")
abline(v=median(betas[mutant!="*" & wildtype!=mutant,n_bc_bind_lib1+n_bc_bind_lib2]),lty=2);abline(h=0.5,lty=2)
plot(ecdf(betas[mutant!="*" & wildtype!=mutant,n_bc_1mut_bind_lib1+n_bc_1mut_bind_lib2]),verticals=T,pch=NA,col.01line=NA,main=paste("pooled binding scores, 1mut barcodes,\nmedian =",median(betas[mutant!="*" & wildtype!=mutant,n_bc_1mut_bind_lib1+n_bc_1mut_bind_lib2])),xlab="number of single-mutant barcodes",ylab="fraction mutations found < X times",xlim=c(0,25))
abline(v=median(betas[mutant!="*" & wildtype!=mutant,n_bc_1mut_bind_lib1+n_bc_1mut_bind_lib2]),lty=2);abline(h=0.5,lty=2)
#next, output ecdf for all-barcode counts *only for single-lib obs that do not have single-mut barcodes*
plot(ecdf(c(betas[mutant!="*" & wildtype!=mutant & n_bc_1mut_bind_lib1 == 0, n_bc_bind_lib1],betas[mutant!="*" & wildtype!=mutant & n_bc_1mut_bind_lib2 == 0, n_bc_bind_lib2])), xlim=c(0,50), verticals=T,pch=NA, col.01line=NA, main=paste("single-lib, no direct 1mut bc\nmedian =",median(c(betas[mutant!="*" & wildtype!=mutant & n_bc_1mut_bind_lib1 == 0, n_bc_bind_lib1],betas[mutant!="*" & wildtype!=mutant & n_bc_1mut_bind_lib2 == 0, n_bc_bind_lib2]))),xlab="number of barcodes",ylab="fraction mutations found < X times")
abline(v=median(c(betas[mutant!="*" & wildtype!=mutant & n_bc_1mut_bind_lib1 == 0, n_bc_bind_lib1],betas[mutant!="*" & wildtype!=mutant & n_bc_1mut_bind_lib2 == 0, n_bc_bind_lib2])),lty=2);abline(h=0.5,lty=2)
#next, output ecdf for 1mut counts *only for single-lib obs that have at least 1 1mut barcode*
plot(ecdf(c(betas[mutant!="*" & wildtype!=mutant & n_bc_1mut_bind_lib1 > 0, n_bc_1mut_bind_lib1],betas[mutant!="*" & wildtype!=mutant & n_bc_1mut_bind_lib2 > 0, n_bc_1mut_bind_lib2])), xlim=c(0,25), verticals=T,pch=NA, col.01line=NA, main=paste("single-lib, 1mut bc for directly sampled muts\nmedian =",median(c(betas[mutant!="*" & wildtype!=mutant & n_bc_1mut_bind_lib1 > 0, n_bc_1mut_bind_lib1],betas[mutant!="*" & wildtype!=mutant & n_bc_1mut_bind_lib2 > 0, n_bc_1mut_bind_lib2]))),xlab="number of barcodes",ylab="fraction mutations found < X times")
abline(v=median(c(betas[mutant!="*" & wildtype!=mutant & n_bc_1mut_bind_lib1 > 0, n_bc_1mut_bind_lib1],betas[mutant!="*" & wildtype!=mutant & n_bc_1mut_bind_lib2 > 0, n_bc_1mut_bind_lib2])),lty=2);abline(h=0.5,lty=2)
invisible(dev.print(pdf, paste(config$single_mut_effects_dir,"/coverage-ecdf_bind.pdf",sep="")))
```
In lib1, we directly measured the effects of `r round(sum(!is.na(betas[wildtype!=mutant & mutant!="*","bind_lib1_direct"]))/nrow(betas[wildtype!=mutant & mutant!="*",])*100,digits=2)`% of mutations *as sole mutations* on at least one barcode background (a higher percentage of mutations are sampled more extensively on multiple mutant backgrounds). In lib2, we directly measured `r round(sum(!is.na(betas[wildtype!=mutant & mutant!="*","bind_lib2_direct"]))/nrow(betas[wildtype!=mutant & mutant!="*",])*100,digits=2)`% of mutations as singles. Taken together, we directly measured `r round(sum(!is.na(betas[wildtype!=mutant & mutant!="*", "bind_lib1_direct"]) | !is.na(betas[wildtype!=mutant & mutant!="*", "bind_lib2_direct"]))/nrow(betas[wildtype!=mutant & mutant!="*",])*100,digits=2)`% of mutations as singles in at least one of the two libraries, and we directly measured `r round(sum(!is.na(betas[wildtype!=mutant & mutant!="*","bind_lib1_direct"]) & !is.na(betas[wildtype!=mutant & mutant!="*","bind_lib2_direct"]))/nrow(betas[wildtype!=mutant & mutant!="*",])*100,digits=2)`% of mutations as single mutations in both libraries.
First, let's look at the correlation between these directly measured single-mutant effects in the two libraries.
```{r bind_effects_direct_singles_plot, fig.width=5,fig.height=5,fig.align="center", dpi=300,dev="png"}
x <- betas$bind_lib1_direct; y <- betas$bind_lib2_direct; fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="lib1 mutational effect from single mut bcs",ylab="lib2 mutational effect from single mut bcs",main=paste("binding, single mut direct measurements\nR-squared:",round(summary(fit)$r.squared,digits=3)))
invisible(dev.print(pdf, paste(config$single_mut_effects_dir,"/correlation_direct-measured-single-mut_bind.pdf",sep="")))
```
Let's look at the sites with directly measured beneficial mutations (output below). As you can see [here](https://dms-view.github.io/?pdb-url=https%3A%2F%2Fraw.githubusercontent.com%2Fdms-view%2FSARS-CoV-2%2Fmaster%2Fdata%2FSpike%2FBloomLab2020%2F6m0j.pdb&markdown-url=https%3A%2F%2Fraw.githubusercontent.com%2Fdms-view%2FSARS-CoV-2%2Fmaster%2Fdata%2FSpike%2FBloomLab2020%2FBloomLab_rbd.md&data-url=https%3A%2F%2Fraw.githubusercontent.com%2Fdms-view%2FSARS-CoV-2%2Fmaster%2Fdata%2FSpike%2FBloomLab2020%2Fresults%2FBloomLab2020_rbd.csv&condition=natural+frequencies&site_metric=site_entropy&mutation_metric=mut_frequency&selected_sites=367%2C453%2C484%2C498%2C501%2C505%2C528), five of the seven sites with the strongest directly-measured beneficial effect are at the ACE2 interface, suggesting these might be valid affinity-enhancing mutations. This is in contrast to the beneficial latent-effect measurements we looked at above, which were at a combination of ACE2-proximal and ACE2-distal sites, seemingly including both stabiltiy- and affinity-mediated effects. This observation is consistent with the model that stability-enhancing mutations, by and large, don't further improve affinity in the stable wildtype background, though they may improve affinity in destabilized secondary mutant backgrounds and therefore acquire positive latent-scale coefficients.
```{r direct_singles_beneficial_binding}
betas[bind_lib1_direct>0.1 & bind_lib2_direct>0.1,c("mutation","site_SARS2")]
unique(betas[bind_lib1_direct>0.1 & bind_lib2_direct>0.1,site_SARS2])
```
How do the mutation effect coefficients estimated in global epistasis/regression models correlate with these direct measurements of single mutant effects?
The plots below show the average directly measured phenotype for single mutant barcodes in the two libraries (only showing mutations that were sampled as singles in *both* libraries), versus the average of the beta coefficients for the mutation from the lib1 and lib2 model fits (once again, only showing mutations that were determined in *both* libraries). Note, these correlations are therefore biased toward the best-coverage mutants we have, since they are only for mutants that were sampled as singles in both libraries, which are going to probably be at higher frequency in general, and therefore are better determined with direct experiment as well as model decompositions. We will investigate lower-coverage mutants further below.
```{r bind_avg_coefs_v_direct_singles, fig.width=8,fig.height=8,fig.align="center", dpi=300,dev="png"}
par(mfrow=c(2,2))
#Cauchy, observed
x <- rowMeans(betas_bind_observed_Cauchy[,c("effect_lib1","effect_lib2")]);y <- NULL; for(i in 1:nrow(betas_bind_observed_Cauchy)){y <- c(y, betas[mutation_RBD==betas_bind_observed_Cauchy$mutation[i],mean(c(bind_lib1_direct,bind_lib2_direct))])};fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="average lib1&lib2 beta, Cauchy observed",ylab="direct single mut measurement",main=paste("Directly measured vs Cauchy, observed scale.\n R-squared:",round(summary(fit)$r.squared,digits=3)))
#Gaussian, observed
x <- rowMeans(betas_bind_observed_Gaussian[,c("effect_lib1","effect_lib2")]);y <- NULL; for(i in 1:nrow(betas_bind_observed_Gaussian)){y <- c(y, betas[mutation_RBD==betas_bind_observed_Gaussian$mutation[i],mean(c(bind_lib1_direct,bind_lib2_direct))])};fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="average lib1&lib2 beta",ylab="direct single mut measurement",main=paste("Directly measured vs Gaussian, observed scale.\n R-squared:",round(summary(fit)$r.squared,digits=3)))
#Cauchy, latent
x <- rowMeans(betas_bind_latent_Cauchy[,c("effect_lib1","effect_lib2")]);y <- NULL; for(i in 1:nrow(betas_bind_latent_Cauchy)){y <- c(y, betas[mutation_RBD==betas_bind_latent_Cauchy$mutation[i],mean(c(bind_lib1_direct,bind_lib2_direct))])};fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="average lib1&lib2 beta, Cauchy latent",ylab="direct single mut measurement",main=paste("Directly measured vs Cauchy, latent scale.\n R-squared:",round(summary(fit)$r.squared,digits=3)))
#Gaussian, latent
x <- rowMeans(betas_bind_latent_Gaussian[,c("effect_lib1","effect_lib2")]);y <- NULL; for(i in 1:nrow(betas_bind_latent_Gaussian)){y <- c(y, betas[mutation_RBD==betas_bind_latent_Gaussian$mutation[i],mean(c(bind_lib1_direct,bind_lib2_direct))])};fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="average lib1&lib2 beta",ylab="direct single mut measurement",main=paste("Directly measured vs Gaussian, latent scale.\n R-squared:",round(summary(fit)$r.squared,digits=3)))
invisible(dev.print(pdf, paste(config$single_mut_effects_dir,"/correlation_global-epistasis_v_direct-observed-singles_bind.pdf",sep="")))
```
In comparing model coefficients to the directly sampled single mutants, it *does* seem like the latent-scale beneficial mutations are not necessarily observed to enhance affinity when measured directly as single mutations on the wildtype background (see the "bend" near [0,0] in the lower plots). This lends further support to the idea that the *observed* scale effects are our model coefficients of interest, and could explain why the global epistasis models fit a threshold at the wildtype in the nonlinearity, disallowing beneficial latent scale mutations (which, I hypothesize, are largely stabilizing mutations) from exhibiting beneficial affinity effects on the stable WT background, as by and large, the direct measurements of these mutations were ~neutral on the wildtype background.
(We also inferred *joint* models, where we concatenated all of the barcode phenotype measurements from the two libraries together before model fitting -- instead of averaging beta coefficients from two fits, we are therefore just fitting one beta to all of the data simultaneously. I made plots like above for these joint coefficients, compared to the average of the two replicate coefficients -- in every case these joint models perform slightly worse than the average coefficient, so I have removed this code for now.)
The plots above are sort of the "best case" correlation between model and directly measured variants, as they are constructed for mutations that are prone to being high-coverage by virtue of being directly sampled in both libraries. The Cauchy-likelihood observed-scale model coefficients perform the best above -- how do they perform in a more difficult circumstance (and one that is potentially indicative of their eventual use), in which it's "filling in" mutations that were not directly sampled, which are therefore inherently lower-coverage?
To look at this scenario, we take mutations that are *not* directly sampled as single mutations in lib1, but are in lib2, and see how well the lib1 model coefficient correlates with the lib2 direct measurements. We also do the inverse (plot lib2 coefficients versus lib1-only direct samples), and plot the two versions concatenated together for better N.
```{r bind_lib1_coef_v_lib2_direct, fig.width=12,fig.height=4.25,fig.align="center", dpi=300,dev="png"}
par(mfrow=c(1,3))
lib1_coef_lib2_direct <- betas[is.na(bind_lib1_direct) & !is.na(bind_lib2_direct) & mutant!=wildtype & mutant!="*",]
lib1_coef_lib2_direct$bind_lib1_beta <- as.numeric(NA)
for(i in 1:nrow(lib1_coef_lib2_direct)){
lib1_coef_lib2_direct[i,"bind_lib1_beta"] <- betas_bind_observed_Cauchy[betas_bind_observed_Cauchy$mutation==lib1_coef_lib2_direct[i,mutation_RBD],"effect_lib1"]
}
x<-lib1_coef_lib2_direct$bind_lib1_beta; y<-lib1_coef_lib2_direct$bind_lib2_direct; fit<-lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="lib1 beta coefficient, Cauchy observed",ylab="lib2 direct single mut measurement",main=paste("Directly measured vs Cauchy observed, single-lib-NA\n R-squared:",round(summary(fit)$r.squared,digits=3)))
lib2_coef_lib1_direct <- betas[!is.na(bind_lib1_direct) & is.na(bind_lib2_direct) & mutant!=wildtype & mutant!="*",]
lib2_coef_lib1_direct$bind_lib2_beta <- as.numeric(NA)
for(i in 1:nrow(lib2_coef_lib1_direct)){
lib2_coef_lib1_direct[i,"bind_lib2_beta"] <- betas_bind_observed_Cauchy[betas_bind_observed_Cauchy$mutation==lib2_coef_lib1_direct[i,mutation_RBD],"effect_lib2"]
}
x<-lib2_coef_lib1_direct$bind_lib2_beta; y<-lib2_coef_lib1_direct$bind_lib1_direct; fit<-lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="lib2 beta coefficient, Cauchy observed",ylab="lib1 direct single mut measurement",main=paste("Directly measured vs Cauchy observed, single-lib-NA\n R-squared:",round(summary(fit)$r.squared,digits=3)))
x <- c(lib1_coef_lib2_direct$bind_lib1_beta,lib2_coef_lib1_direct$bind_lib2_beta);y<-c(lib1_coef_lib2_direct$bind_lib2_direct, lib2_coef_lib1_direct$bind_lib1_direct);fit<-lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="libX beta coefficient, Cauchy observed",ylab="libY direct single mut measurement",main=paste("Directly measured vs Cauchy observed, single-lib-NA\n R-squared:",round(summary(fit)$r.squared,digits=3)))
invisible(dev.print(pdf, paste(config$single_mut_effects_dir,"/correlation_global-epistasis_v_direct-observed-singles_bind_one-lib-only.pdf",sep="")))
```
For mutations that are not sampled as single mutants in either library, we cannot of course compare to any direct measurements, but we can simply see how well correlated the two replicate model beta coefficients are. This is shown below, once again with the Cauchy observed scale coefficients. These are likely to be the worst beta coefficients, and yet they still correlate quite well.
```{r bind_coef_no_direct, fig.width=4.75,fig.height=5,fig.align="center", dpi=300,dev="png"}
lib1_coef_lib2_coef <- betas[is.na(bind_lib1_direct) & is.na(bind_lib2_direct) & mutant!=wildtype & mutant!="*",]
lib1_coef_lib2_coef$bind_lib1_beta <- as.numeric(NA)
lib1_coef_lib2_coef$bind_lib2_beta <- as.numeric(NA)
for(i in 1:nrow(lib1_coef_lib2_coef)){
if(lib1_coef_lib2_coef$mutation_RBD[i] %in% betas_bind_observed_Cauchy$mutation){
lib1_coef_lib2_coef[i,"bind_lib1_beta"] <- betas_bind_observed_Cauchy[betas_bind_observed_Cauchy$mutation==lib1_coef_lib2_coef[i,mutation_RBD],"effect_lib1"]
lib1_coef_lib2_coef[i,"bind_lib2_beta"] <- betas_bind_observed_Cauchy[betas_bind_observed_Cauchy$mutation==lib1_coef_lib2_coef[i,mutation_RBD],"effect_lib2"]
}else{
lib1_coef_lib2_coef[i,"bind_lib1_beta"] <- NA
lib1_coef_lib2_coef[i,"bind_lib2_beta"] <- NA
}
}
x<-lib1_coef_lib2_coef$bind_lib1_beta; y<-lib1_coef_lib2_coef$bind_lib2_beta; fit<-lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="lib1 beta coefficient, Cauchy observed",ylab="lib2 beta coefficient, Cauchy observed",main=paste("Model coefs, muts with no direct obs\n R-squared:",round(summary(fit)$r.squared,digits=3)))
invisible(dev.print(pdf, paste(config$single_mut_effects_dir,"/correlation_global-epistasis-coefs_no-direct-measure-values_bind.pdf",sep="")))
```
Overall, my intuition from looking at the mutations with positive latent-scale versus directly measured beneficial effects, along with the shapes of correlations between latent-scale and direct measurements, leads me to believe that stability is a large contributor to the latent-scale effects of mutations. Because of this, and the fact that stabilizing mutations don't enhance affinity in the stabilized WT background, the observed scale nonlinearity censors the beneficial latent-effect mutations to have no effect (beta = 0) on the WT background observed-scale. However, this complete censoring of positive observed-scale coefficients contrasts with our observation from directly sampled measurements, that there do appear to be *some* true affinity-enhancing single mutants in the wildtype background.
Taken together, then, my proposal is as follows:
* For any single mutation that was directly sampled on at least one barcode in a replicate, use the directly measured single mutant effect from those direct samples for that replicate.
* For any single mutation that was not directly sampled on at least one barcode, use the Cauchy observed-scale global epistasis coefficient for that replicate.
* Then, for each mutation's 'final' effect, we take the average effect of its value in the two replicate libraries. Based on the numbers output above:
* For ~78% of single mutants, this average will be computed from two directly-measured single mutant effects. This class of replicate measurements are described by the correlation above with R-squared of ~0.974
* For about 16% of single mutants, this average will be computed from one directly-measured single mutant effect, and one model coefficient from a global epistasis model. This class of replicate measurements are described by the correlation above with R-squared of ~0.876
* For about 6% of single mutants, this average will be computed from two global epistasis model coefficients. This class of replicate measurements are described by the correlation above with R-squared of ~0.766
The main thing to note again at this point, is that the global epistasis model coefficients are disallowed from values >0. Taken together, I think this behavior is acceptable -- positing a beneficial mutational effect on binding from these modeled coefficients alone, which are less precisely determined, would be less convincing. So, I think taking a more conservative approach per the global epistasis model, disallowing non-direct measurements from being assigned affinity-enhancing effects, is appropriate.
Here, we add the Cauchy observed-scale model-predicted beta coefficients to our growing "betas" master data table. We then derive a final lib1 and lib2 mutational effect, which is the directly sampled single mutant effect if present, or the model coefficient if not. Finallly, we calculate the final binding effect for a mutation, as the average of the lib1 and lib2 scores.
```{r bind_add_model_coefs}
betas[,bind_lib1_coef := betas_bind_observed_Cauchy[betas_bind_observed_Cauchy$mutation==mutation_RBD,"effect_lib1"],by=mutation_RBD]
betas[,bind_lib2_coef := betas_bind_observed_Cauchy[betas_bind_observed_Cauchy$mutation==mutation_RBD,"effect_lib2"],by=mutation_RBD]
betas[,bind_lib1 := bind_lib1_direct]
betas[is.na(bind_lib1),bind_lib1 := bind_lib1_coef]
betas[,bind_lib2 := bind_lib2_direct]
betas[is.na(bind_lib2),bind_lib2 := bind_lib2_coef]
betas[,bind_avg := mean(c(bind_lib1,bind_lib2),na.rm=T),by=mutation]
```
Here is the final plot showing correlation in mutational effects on binding between the two libraries. Overall, we derived single mutant binding scores for `r nrow(betas[mutant!=wildtype & mutant!="*" & !is.na(bind_avg),])` of `r nrow(betas[mutant!=wildtype & mutant!="*",])` RBD mutations.
```{r bind_final_correlation, fig.width=4,fig.height=4.25,fig.align="center", dpi=300,dev="png"}
x <- betas$bind_lib1; y <- betas$bind_lib2; fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="lib1 mutational effect",ylab="lib2 mutational effect",main=paste("binding, final muts\nR-squared:",round(summary(fit)$r.squared,digits=3)))
invisible(dev.print(pdf, paste(config$single_mut_effects_dir,"/correlation_final-single-mut_bind.pdf",sep="")))
```
## Assessing global epistasis models for expression data
Next, we will assess the global epistasis models built on the expression measurements. Let's first look at the correlation between model coefficients inferred from lib1 and lib2 expression measurements, both on the "observed" mean fluorescece and underlying "latent" scales, along with outputs from models with no global epistasis nonlinear correction.
```{r betas_expr_lib1_v_lib2, fig.width=8,fig.height=13,fig.align="center", dpi=300,dev="png"}
par(mfrow=c(3,2))
#observed "mean fluor" scale
#Cauchy
x <- betas_expr_observed_Cauchy$effect_lib1; y <- betas_expr_observed_Cauchy$effect_lib2; fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="lib1 beta (observed scale)",ylab="lib2 beta (observed scale)",main=paste("expression, Cauchy observed, R-squared:",round(summary(fit)$r.squared,digits=3)))
#Gaussian
x <- betas_expr_observed_Gaussian$effect_lib1; y <- betas_expr_observed_Gaussian$effect_lib2; fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="lib1 beta (observed scale)",ylab="lib2 beta (observed scale)",main=paste("expression, Gaussian observed, R-squared:",round(summary(fit)$r.squared,digits=3)))
#underlying latent scale
#Cauchy
x <- betas_expr_latent_Cauchy$effect_lib1; y <- betas_expr_latent_Cauchy$effect_lib2; fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="lib1 beta (latent scale)",ylab="lib2 beta (latent scale)",main=paste("expression, Cauchy latent, R-squared:",round(summary(fit)$r.squared,digits=3)))
#Gaussian
x <- betas_expr_latent_Gaussian$effect_lib1; y <- betas_expr_latent_Gaussian$effect_lib2; fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="lib1 beta (latent scale)",ylab="lib2 beta (latent scale)",main=paste("expression, Gaussian latent, R-squared:",round(summary(fit)$r.squared,digits=3)))
#no nonlinear transform
#Cauchy
x <- betas_expr_nonepistatic_Cauchy$effect_lib1; y <- betas_expr_nonepistatic_Cauchy$effect_lib2; fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="lib1 beta (no nonlinearity)",ylab="lib2 beta (no nonlinearity)",main=paste("expression, Cauchy nonepistatic, R-squared:",round(summary(fit)$r.squared,digits=3)))
#Gaussian
x <- betas_expr_nonepistatic_Gaussian$effect_lib1; y <- betas_expr_nonepistatic_Gaussian$effect_lib2; fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="lib1 beta (no nonlinearity)",ylab="lib2 beta (no nonlinearity)",main=paste("expression, Gaussian nonepistatic, R-squared:",round(summary(fit)$r.squared,digits=3)))
```
The observed-scale plots show some obvious errant fit points with large positive predicted effects. Let's filter these points to be NA in the library in which they were assigned these large values, and look at these plots with the filtered data. Also, for better comparison to the latent-scale phenotypes, we zoom in on the range of the latent scale excluding the highly deleterious cluster of STOP variants near [-15, -15]. We are *not* removing these points from the correlation computation, however, but simply want to investigate the correlation by eye in the region closer to wildtype.
```{r betas_expr_lib1_v_lib2_filtered, fig.width=8,fig.height=13,fig.align="center", dpi=300,dev="png"}
for(i in 1:nrow(betas_expr_observed_Cauchy)){
if(!is.na(betas_expr_observed_Cauchy$effect_lib1[i]) & betas_expr_observed_Cauchy$effect_lib1[i] > 1){betas_expr_observed_Cauchy$effect_lib1[i] <- NA}
if(!is.na(betas_expr_observed_Cauchy$effect_lib2[i]) & betas_expr_observed_Cauchy$effect_lib2[i] > 1){betas_expr_observed_Cauchy$effect_lib2[i] <- NA}
}
for(i in 1:nrow(betas_expr_observed_Gaussian)){
if(!is.na(betas_expr_observed_Gaussian$effect_lib1[i]) & betas_expr_observed_Gaussian$effect_lib1[i] > 1){betas_expr_observed_Gaussian$effect_lib1[i] <- NA}
if(!is.na(betas_expr_observed_Gaussian$effect_lib2[i]) & betas_expr_observed_Gaussian$effect_lib2[i] > 1){betas_expr_observed_Gaussian$effect_lib2[i] <- NA}
}
par(mfrow=c(3,2))
#observed "mean fluor" scale
#Cauchy
x <- betas_expr_observed_Cauchy$effect_lib1; y <- betas_expr_observed_Cauchy$effect_lib2; fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="lib1 beta (observed scale)",ylab="lib2 beta (observed scale)",main=paste("expression, Cauchy observed, R-squared:",round(summary(fit)$r.squared,digits=3)))
#Gaussian
x <- betas_expr_observed_Gaussian$effect_lib1; y <- betas_expr_observed_Gaussian$effect_lib2; fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="lib1 beta (observed scale)",ylab="lib2 beta (observed scale)",main=paste("expression, Gaussian observed, R-squared:",round(summary(fit)$r.squared,digits=3)))
#underlying latent scale
#Cauchy
x <- betas_expr_latent_Cauchy$effect_lib1; y <- betas_expr_latent_Cauchy$effect_lib2; fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlim=c(-2,1),ylim=c(-2,1),xlab="lib1 beta (latent scale)",ylab="lib2 beta (latent scale)",main=paste("expression, Cauchy latent, R-squared:",round(summary(fit)$r.squared,digits=3)))
#Gaussian
x <- betas_expr_latent_Gaussian$effect_lib1; y <- betas_expr_latent_Gaussian$effect_lib2; fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlim=c(-2,1),ylim=c(-2,1),xlab="lib1 beta (latent scale)",ylab="lib2 beta (latent scale)",main=paste("expression, Gaussian latent, R-squared:",round(summary(fit)$r.squared,digits=3)))
#no nonlinear transform
#Cauchy
x <- betas_expr_nonepistatic_Cauchy$effect_lib1; y <- betas_expr_nonepistatic_Cauchy$effect_lib2; fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="lib1 beta (no nonlinearity)",ylab="lib2 beta (no nonlinearity)",main=paste("expression, Cauchy nonepistatic, R-squared:",round(summary(fit)$r.squared,digits=3)))
#Gaussian
x <- betas_expr_nonepistatic_Gaussian$effect_lib1; y <- betas_expr_nonepistatic_Gaussian$effect_lib2; fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="lib1 beta (no nonlinearity)",ylab="lib2 beta (no nonlinearity)",main=paste("expression, Gaussian nonepistatic, R-squared:",round(summary(fit)$r.squared,digits=3)))
invisible(dev.print(pdf, paste(config$single_mut_effects_dir,"/correlation_global-epiistasis-coefs_expr.pdf",sep="")))
```
We can see once again that the correlation in estimates of single mutation effects is better when accounting for nonlinearity in the mean fluorescence expression metric -- this shows that the global epistasis transform improves this single mutation deconvolution from the multi-mutants in the library. The steep slope in the nonlinearity fit between latent and observed scale in the global_epistasis_expression notebook creates a "shrinking" of the scale of latent effects for the majority of mutations -- this probably comes from the fact that the global epistasis model is set so that the average latent-effect coefficient is 1, but when there are some mutations (primarily nonsense mutants) with extremely negative latent-scale effects, it shrinks the range for the rest of mutations and creates this sharp slope.
In any case, the observed-scale measurements look very well-correlated between replicates. Finally, on each of these scales, it looks like there's quite a large number of mutations with positive expression effects, which will be interesting to look at later on.
Next, let's look at directly sampled single-mutant effects on expression from barcodes carrying just single mutations. We calculate the number of barcodes each mutation is found on, including single-mutant barcodes and all barcodes for which expression measurements were fit and kept post-filtering. We also make cdf plots of the number of barcodes on which a mutation is sampled.
```{r expr_effects_direct_singles, fig.width=9,fig.height=4.5,fig.align="center", dpi=300,dev="png"}
#add to master "betas" data table that lists *all* mutations, including those that are undetermined in the model outputs
bc_expr[,aa_subs_list := list(strsplit(aa_substitutions,split=" ")),by=.(library,barcode)]
#gives total number of barcodes with a determined expring phenotype in each library on which a genotype was sampled (takes a while to compute)
betas[,n_bc_expr_lib1 := sum(unlist(lapply(bc_expr[library=="lib1" & !is.na(ML_meanF),aa_subs_list], function(x) mutation_RBD %in% x))),by=mutation]
betas[,n_bc_expr_lib2 := sum(unlist(lapply(bc_expr[library=="lib2" & !is.na(ML_meanF),aa_subs_list], function(x) mutation_RBD %in% x))),by=mutation]
for(i in 1:nrow(betas)){
delta_meanF_1 <- bc_expr[aa_substitutions==betas[i,"mutation_RBD"] & library=="lib1",delta_ML_meanF]
delta_meanF_2 <- bc_expr[aa_substitutions==betas[i,"mutation_RBD"] & library=="lib2",delta_ML_meanF]
betas$n_bc_1mut_expr_lib1[i] <- sum(!is.na(delta_meanF_1)) #gives number of single-mutant barcodes with a phenotype on which a genotype was sampled
betas$n_bc_1mut_expr_lib2[i] <- sum(!is.na(delta_meanF_2))
betas$expr_lib1_direct[i] <- mean(delta_meanF_1,na.rm=T)
betas$expr_lib2_direct[i] <- mean(delta_meanF_2,na.rm=T)
}
par(mfrow=c(1,2))
#output ecdf plots: for all mutants, counts across all barcodes and across 1mut barcodes between the pooled libraries, and separately for each library measurement which is used to average
plot(ecdf(betas[mutant!="*" & wildtype!=mutant,n_bc_expr_lib1+n_bc_expr_lib2]),xlim=c(0,200),verticals=T,pch=NA,col.01line=NA,main=paste("pooled expression scores, median =",median(betas[mutant!="*" & wildtype!=mutant,n_bc_expr_lib1+n_bc_expr_lib2])),xlab="number of barcodes",ylab="fraction mutations found < X times")
abline(v=median(betas[mutant!="*" & wildtype!=mutant,n_bc_expr_lib1+n_bc_expr_lib2]),lty=2);abline(h=0.5,lty=2)
plot(ecdf(betas[mutant!="*" & wildtype!=mutant,n_bc_1mut_expr_lib1+n_bc_1mut_expr_lib2]),verticals=T,pch=NA,col.01line=NA,main=paste("pooled expression scores, 1mut barcodes,\nmedian =",median(betas[mutant!="*" & wildtype!=mutant,n_bc_1mut_expr_lib1+n_bc_1mut_expr_lib2])),xlab="number of single-mutant barcodes",ylab="fraction mutations found < X times",xlim=c(0,25))
abline(v=median(betas[mutant!="*" & wildtype!=mutant,n_bc_1mut_expr_lib1+n_bc_1mut_expr_lib2]),lty=2);abline(h=0.5,lty=2)
invisible(dev.print(pdf, paste(config$single_mut_effects_dir,"/coverage-ecdf_expr.pdf",sep="")))
```
In lib1, we directly measured the effects of `r round(sum(!is.na(betas[wildtype!=mutant & mutant!="*","expr_lib1_direct"]))/nrow(betas[wildtype!=mutant & mutant!="*",])*100,digits=2)`% of mutations *as sole mutations* on at least one barcode background (a higher percentage of mutations are sampled more extensively on multiple mutant backgrounds). In lib2, we directly measured `r round(sum(!is.na(betas[wildtype!=mutant & mutant!="*","expr_lib2_direct"]))/nrow(betas[wildtype!=mutant & mutant!="*",])*100,digits=2)`% of mutations as singles. Taken together, we directly measured `r round(sum(!is.na(betas[wildtype!=mutant & mutant!="*", "expr_lib1_direct"]) | !is.na(betas[wildtype!=mutant & mutant!="*", "expr_lib2_direct"]))/nrow(betas[wildtype!=mutant & mutant!="*",])*100,digits=2)`% of mutations as singles in at least one of the two libraries, and we directly measured `r round(sum(!is.na(betas[wildtype!=mutant & mutant!="*","expr_lib1_direct"]) & !is.na(betas[wildtype!=mutant & mutant!="*","expr_lib2_direct"]))/nrow(betas[wildtype!=mutant & mutant!="*",])*100,digits=2)`% of mutations as single mutations in both libraries.
First, let's look at the correlation between these directly measured single-mutant effects in the two libraries.
```{r expr_effects_direct_singles_plot, fig.width=5,fig.height=5.25,fig.align="center", dpi=300,dev="png"}
x <- betas$expr_lib1_direct; y <- betas$expr_lib2_direct; fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="lib1 mutational effect from single mut bcs",ylab="lib2 mutational effect from single mut bcs",main=paste("expression, single mut direct measurements\nR-squared:",round(summary(fit)$r.squared,digits=3)))
invisible(dev.print(pdf, paste(config$single_mut_effects_dir,"/correlation_direct-measured-single-mut_expr.pdf",sep="")))
```
In this case, in contrast to the binding measurements, we do *not* see better correlation between direct single measurements compared to the correlation in the model-predicted observed-scale effects. This isn't surprising -- the expression measurements are conducted without pre-selecting for properly expressing variants, and we know there are a small number of barcodes that we ascribe as representing wildtype/synonymous variants with very low expression (which are evidently weeded out by the RBD+ pre-sort prior to the binding measurements). These are presumably plasmids with deleterious mutations outside of the PacBio sequencing window, or contain some other artifact -- we are able to identify and discard these in the case of wildtype observations, but for any mutant variant, we cannot distinguish when this is happening at face value. Therefore, direct single measurements of expression effects of mutation are likely conflated with these types of additional factors that we cannot identify, whereas the global epistasis decomposition averages across all backgrounds containing a mutation, generating improved estimates of mutational effects. We also believe our binding measurements are inherently higher quality than the expression measurements for multiple reasons. Taken together, it seems that for the expression measurements, there may be more justification to use all model coefficients, instead of the combination of direct and estimated effects as done with binding.
Next, let's look at how the global epistasis model coefficients correlate with these directly measured single-mutant effects. The plots below show the average directly measured phenotype for single mutant barcodes in the two libraries (only showing mutations that were sampled as singles in *both* libraries), versus the average of the beta coefficients for the mutation from the lib1 and lib2 model fits (once again, only showing mutations that were determined in *both* libraries). For the latent scale, we also zoom in the plot view to get better insight into the relationship between observed and latent effects within the non-censored window.
```{r expr_avg_coefs_v_direct_singles, fig.width=8,fig.height=12,fig.align="center", dpi=300,dev="png"}
par(mfrow=c(3,2))
#Cauchy, observed
x <- rowMeans(betas_expr_observed_Cauchy[,c("effect_lib1","effect_lib2")]);y <- NULL; for(i in 1:nrow(betas_expr_observed_Cauchy)){y <- c(y, betas[mutation_RBD==betas_expr_observed_Cauchy$mutation[i],mean(c(expr_lib1_direct,expr_lib2_direct))])};fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="average lib1&lib2 beta, Cauchy observed",ylab="direct single mut measurement",main=paste("Directly measured vs Cauchy, observed scale.\n expression R-squared:",round(summary(fit)$r.squared,digits=3)))
#Gaussian, observed
x <- rowMeans(betas_expr_observed_Gaussian[,c("effect_lib1","effect_lib2")]);y <- NULL; for(i in 1:nrow(betas_expr_observed_Gaussian)){y <- c(y, betas[mutation_RBD==betas_expr_observed_Gaussian$mutation[i],mean(c(expr_lib1_direct,expr_lib2_direct))])};fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="average lib1&lib2 beta",ylab="direct single mut measurement",main=paste("Directly measured vs Gaussian, observed scale.\n expression R-squared:",round(summary(fit)$r.squared,digits=3)))
#Cauchy, latent
x <- rowMeans(betas_expr_latent_Cauchy[,c("effect_lib1","effect_lib2")]);y <- NULL; for(i in 1:nrow(betas_expr_latent_Cauchy)){y <- c(y, betas[mutation_RBD==betas_expr_latent_Cauchy$mutation[i],mean(c(expr_lib1_direct,expr_lib2_direct))])};fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="average lib1&lib2 beta, Cauchy latent",ylab="direct single mut measurement",main=paste("Directly measured vs Cauchy, latent scale.\n expression R-squared:",round(summary(fit)$r.squared,digits=3)))
#Gaussian, latent
x <- rowMeans(betas_expr_latent_Gaussian[,c("effect_lib1","effect_lib2")]);y <- NULL; for(i in 1:nrow(betas_expr_latent_Gaussian)){y <- c(y, betas[mutation_RBD==betas_expr_latent_Gaussian$mutation[i],mean(c(expr_lib1_direct,expr_lib2_direct))])};fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="average lib1&lib2 beta",ylab="direct single mut measurement",main=paste("Directly measured vs Gaussian, latent scale.\n expression R-squared:",round(summary(fit)$r.squared,digits=3)))
#Cauchy, latent, zoomed in
x <- rowMeans(betas_expr_latent_Cauchy[,c("effect_lib1","effect_lib2")]);y <- NULL; for(i in 1:nrow(betas_expr_latent_Cauchy)){y <- c(y, betas[mutation_RBD==betas_expr_latent_Cauchy$mutation[i],mean(c(expr_lib1_direct,expr_lib2_direct))])};fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlim=c(-1,0.5),ylim=c(-4,1),xlab="average lib1&lib2 beta, Cauchy latent",ylab="direct single mut measurement",main=paste("Directly measured vs Cauchy, zoom latent scale.\n expression R-squared:",round(summary(fit)$r.squared,digits=3)))
#Gaussian, latent, zommed in
x <- rowMeans(betas_expr_latent_Gaussian[,c("effect_lib1","effect_lib2")]);y <- NULL; for(i in 1:nrow(betas_expr_latent_Gaussian)){y <- c(y, betas[mutation_RBD==betas_expr_latent_Gaussian$mutation[i],mean(c(expr_lib1_direct,expr_lib2_direct))])};fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlim=c(-1,0.5),ylim=c(-4,1),xlab="average lib1&lib2 beta",ylab="direct single mut measurement",main=paste("Directly measured vs Gaussian, zoom latent scale.\n expression R-squared:",round(summary(fit)$r.squared,digits=3)))
invisible(dev.print(pdf, paste(config$single_mut_effects_dir,"/correlation_global-epistasis_v_direct-observed-singles_expr.pdf",sep="")))
```
Taken together, the observed-scale global epistasis coefficients appear to deviate from directly sampled measurements, such that the observed-scale predictions generate more positive expression effects than are actually observed in the direct sampling. This could point to an incorrect/incomplete global nonlinearity correction? I think the errant negative points could also be contributing (that is, mutations which the model predicts should have ~neutral effects, but experimentally were determined to have large deleteirous effects on expression -- we filtered out wildtype/synonymous barcodes of this type, but we do not have a basis on which to filter out mutant variants since we can not identify these at face value). It does appear that perhaps the Cauchy likelihood model is more tolerant of these outliers (it seems to have more points in this lower-right quadrant, and mathematically, it should be more tolerant of this type of outlier behavior because of it's fatter tails). The Cauchy likelihood does have marginally smaller R-squared than Gaussian likelihood (perhaps because it *is* more tolerant of these outliers), but it appears to exhibit less of this "bending" in the curve versus directly sampled singles, which is behavior that seems preferred, so we will go ahead with the Cauchy-likelihood observed-scale measurements as our expression effects of mutations -- and because these correlate better between replicates than do the directly sampled single-mutant barcodes, in contrast to the binding phenotypes, we will use *all* coefficients, not a mix of direct and modeled effects.
```{r expr_add_model_coefs}
betas[,expr_lib1_coef := betas_expr_observed_Cauchy[betas_expr_observed_Cauchy$mutation==mutation_RBD,"effect_lib1"],by=mutation_RBD]
betas[,expr_lib2_coef := betas_expr_observed_Cauchy[betas_expr_observed_Cauchy$mutation==mutation_RBD,"effect_lib2"],by=mutation_RBD]
betas[,expr_lib1 := expr_lib1_coef]
betas[,expr_lib2 := expr_lib2_coef]
betas[,expr_avg := mean(c(expr_lib1,expr_lib2),na.rm=T),by=mutation]
```
Here is the final plot showing correlation in mutational effects on expression between the two libraries. Overall, we derived single mutant expression scores for `r nrow(betas[mutant!=wildtype & mutant!="*" & !is.na(expr_avg),])` of `r nrow(betas[mutant!=wildtype & mutant!="*",])` RBD mutations.
```{r expr_final_correlation, fig.width=4,fig.height=4.25,fig.align="center", dpi=300,dev="png"}
x <- betas$expr_lib1; y <- betas$expr_lib2; fit <- lm(y~x)
plot(x,y,pch=16,col="#00000067",xlab="lib1 mutational effect",ylab="lib2 mutational effect",main=paste("expression, final muts\nR-squared:",round(summary(fit)$r.squared,digits=3)))
invisible(dev.print(pdf, paste(config$single_mut_effects_dir,"/correlation_final-single-mut_expr.pdf",sep="")))
```
Last, save our estimated single-mutant effects as a csv file. We output the estimated effects on binding and expression in each library and the average of the libraries.
```{r save_output}
betas %>%
mutate_if(is.numeric, round, digits=2) %>%
dplyr::select(site_RBD, site_SARS2, wildtype, mutant, mutation, mutation_RBD,
bind_lib1, bind_lib2, bind_avg, expr_lib1, expr_lib2, expr_avg) %>%
write.csv(file=config$single_mut_effects_file, row.names=F)
```
## Output summary of homolog phenotypes
Let's summarize the binding and expression phenotypes of the homologous RBD sequences that were spiked into our libraries. We compute the average and standard error of binding and expression phenotypes relative to WT SARS-CoV-2 separately in each library, and then calculate the average binding and expression from the two library measurements. For binding, our average is the mean delta-log<sub>10</sub>(*K*<sub>A,app</sub>) relative to SARS-CoV-2 wildtype. For expression, our average is the *median* delta-mean-fluorescence relative to SARS-CoV-2 wildtype. Because some homologs have a tail of lowly expressing barcodes (which are presumably selected out in the RBD+ sort prior to the binding measureements), we use the median as our average because it is more robust to outliers. The standard error of a median is higher than the standard error of the mean by a factor of 1.2533 if the measurement is normally distributed (which we are only moderately in violation of for some homologs), so we multiply our per-library standard errors by this factor.
```{r read_homologs_input}
bc_homologs_bind <- data.table(read.csv(file=config$Titeseq_Kds_homologs_file))
bc_homologs_bind[target=="SARS-CoV",target:="SARS-CoV-1"]
bc_homologs_expr <- data.table(read.csv(file=config$expression_sortseq_homologs_file))
bc_homologs_expr[target=="SARS-CoV",target:="SARS-CoV-1"]
homologs <- data.table(homolog=factor(c("SARS-CoV-2","GD-Pangolin","RaTG13","SARS-CoV-1","WIV16","LYRa11","ZC45","ZXC21","HKU3-1","Rf1","Rp3","BM48-31"),
levels=c("SARS-CoV-2","GD-Pangolin","RaTG13","SARS-CoV-1","WIV16","LYRa11","ZC45","ZXC21","HKU3-1","Rf1","Rp3","BM48-31")),
clade=c("SARS-CoV-2","SARS-CoV-2","SARS-CoV-2","Clade 1","Clade 1","Clade 1","Clade 2", "Clade 2","Clade 2","Clade 2","Clade 2","Clade 3"))
#add color column to homologs, by clade
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
homologs$clade_color <- as.character(NA); homologs[clade=="Clade 1",clade_color := cbPalette[4]]; homologs[clade=="Clade 2",clade_color := cbPalette[2]]; homologs[clade=="Clade 3",clade_color := cbPalette[8]]; homologs[clade=="SARS-CoV-2",clade_color := cbPalette[6]]
#add plotting character to homologs, by clade
homologs$clade_pch <- as.numeric(NA); homologs[clade=="Clade 1",clade_pch := 15]; homologs[clade=="Clade 2",clade_pch := 17]; homologs[clade=="Clade 3",clade_pch := 18]; homologs[clade=="SARS-CoV-2",clade_pch := 16]
for(i in 1:nrow(homologs)){
bind_lib1 <- bc_homologs_bind[library=="lib1" & target==as.character(homologs$homolog[i]),delta_log10Ka]
homologs$bind_lib1[i] <- mean(bind_lib1,na.rm=T)
homologs$bind_lib1_SE[i] <- sd(bind_lib1,na.rm=T)/sqrt(sum(!is.na(bind_lib1)))
bind_lib2 <- bc_homologs_bind[library=="lib2" & target==as.character(homologs$homolog[i]),delta_log10Ka]
homologs$bind_lib2[i] <- mean(bind_lib2,na.rm=T)
homologs$bind_lib2_SE[i] <- sd(bind_lib2,na.rm=T)/sqrt(sum(!is.na(bind_lib2)))
homologs$bind_avg[i] <- mean(c(homologs$bind_lib1[i], homologs$bind_lib2[i]))
expr_lib1 <- bc_homologs_expr[library=="lib1" & target==as.character(homologs$homolog[i]),delta_ML_meanF]
homologs$expr_lib1[i] <- median(expr_lib1,na.rm=T)
homologs$expr_lib1_SE[i] <- 1.2533*sd(expr_lib1,na.rm=T)/sqrt(sum(!is.na(expr_lib1))) #assumes normal distribution which is not quite correct, so revisit this if using SE for any hard-core statistics
expr_lib2 <- bc_homologs_expr[library=="lib2" & target==as.character(homologs$homolog[i]),delta_ML_meanF]
homologs$expr_lib2[i] <- median(expr_lib2,na.rm=T)
homologs$expr_lib2_SE[i] <- 1.2533*sd(expr_lib2,na.rm=T)/sqrt(sum(!is.na(expr_lib2)))
homologs$expr_avg[i] <- mean(c(homologs$expr_lib1[i], homologs$expr_lib2[i]))
}
```
Next, we plot the correlation in homolog phenotypes between the two replicates. (We should add error bars reflecting 95% CI when get a chance.) We can see that the binding phenotypes, which span the spectrum from slightly higher affinity than SARS-CoV-2, to complete loss of binding, correlate extremely well. The expression phenotypes correlate quite well, though not as perfectly -- this is expected, because the expression scores are a noisier phenotype, and also these homologs sample a very narrow near-neutral range, which is inherently more difficult to get perfect correlation between replicates than if some RBDs were completely non-expressing.
```{r homologs_lib1_lib2_correlation, fig.width=8,fig.height=4.25,fig.align="center", dpi=300,dev="png"}
par(mfrow=c(1,2))
x <- homologs$bind_lib1; y <- homologs$bind_lib2; fit <- lm(y~x)
plot(x,y,pch=homologs$clade_pch,col=homologs$clade_color,xlab="delta_log10Ka, lib1",ylab="delta_log10Ka, lib2",main=paste("homolog binding, R-squared:",round(summary(fit)$r.squared,digits=4)),cex=1.4)
x <- homologs$expr_lib1; y <- homologs$expr_lib2; fit <- lm(y~x)
plot(x,y,,pch=homologs$clade_pch,col=homologs$clade_color,xlab="delta_ML_meanF, lib1",ylab="delta_ML_meanF, lib2",main=paste("homolog expression, R-squared:",round(summary(fit)$r.squared,digits=3)),cex=1.4)
invisible(dev.print(pdf, paste(config$single_mut_effects_dir,"/correlation_homolog-replicate-libraries_bind-expr.pdf",sep="")))
```
Output the summarized homolog phenotype scores into a table for downstream analysis and sharing.
```{r homologs_output}
homologs[,-c("clade_color","clade_pch")] %>%
mutate_if(is.numeric, round, digits=4) %>%
write.csv(file=config$homolog_effects_file, row.names=F)
```
## Validation of bulk phenotypes with additional experiments
We have conducted two series of isogenic titration assays which can validate our bulk phenotypes for binding and expression effects. The first panel of genotypes that we validated in isogenic yeast display assays are a subset of the homologs panel. The second is a series of single mutants to SARS-CoV-2. We fit titration curves to these isogenic experiments, and here we correlate these isogenic binding and expression phenotypes with those determined in the DMS assays.
First, let's read in the summary data from homolog titrations reported in `data/isogenic_titrations`. We convert the *K*<sub>D,app</sub> measurements into our log<sub>10</sub>(*K*<sub>A,app</sub>) scale, censoring the nonbinders to the same value censored in the bulk experiments (and removing the nonsensical SE estimates on these values since they were the bounded *K*<sub>D,app</sub> fit, anyway), and propagating the titration curve fit estimate of standard error in the *K*<sub>D,app</sub> measurement. We then plot isogenic versus bulk log<sub>10</sub>(*K*<sub>A,app</sub>) measurements. We can see on the left, below, that our isogenic titrations of these homologs correlate extremely well with the measurements we made in the bulk DMS assay.
We repeat this plotting for the bulk DMS expression data, which does not validate as well in the isogenic measurements -- this is more or less expected. The expression measurements are not as quality to begin with, and they're even worse in the isogenic context (as evidenced by the large standard error bars). One major complication, is that in the bulk Sort-seq measurements, we are not interefered in our measurement by the FITC- mode that is always present in a population due to presumable plasmid loss (reported in other papers, too) -- because only FITC- cells that are there because of low expression per se (rather than plasmid loss) grow out in the post-sort growth prior to PCR and sequencing, we effectively "see" the true negatives and therefore the entire FITC distribution. In the isogenic data, we filter out the FITC- observations, because we are simply measuring fluorescence without sorting and therefore cannot deconvolve the large negative mode due to plasmid loss. Furthermore, because the homologs all have such a tight distribution of expression relative to the effect of single mutants, this is a very small window of phenotypic variability in which we are trying to assess correlation using a measurement with high experimental variability. This may improve when we add some single mutant validations with larger measured expression defecits.
```{r homolog_isogenic_validations, fig.width=8, fig.height=4.75, fig.align="center",dpi=300,dev="png",echo=F}
homologs_validation <- data.table(read.csv(file="data/isogenic_titrations/results/isogenic_titrations_summary.csv"));homologs_validation <- homologs_validation[expt==200508,]; homologs_validation[genotype=="Pangolin-GD",genotype:="GD-Pangolin"]
homologs_validation[,log10Ka := log10(1/Kd),by=titration]
homologs_validation[,log10SE := 0.434*Kd_SE/Kd,by=titration]
homologs_validation[genotype %in% c("HKU3-1","BM48-31"),log10Ka:=6]
homologs_validation[genotype %in% c("HKU3-1","BM48-31"),log10SE:=NA]
homologs_validation[,delta_log10Ka:=log10Ka - homologs_validation[genotype=="SARS-CoV-2",log10Ka],by=titration]
homologs_validation[,delta_logMFI_FITC := mean_logMFI_FITC - homologs_validation[genotype=="SARS-CoV-2",mean_logMFI_FITC],by=titration]
homologs_validation[,SE_delta_logMFI_FITC := sqrt(stderr_logMFI_FITC^2 + homologs_validation[genotype=="SARS-CoV-2",stderr_logMFI_FITC]^2),by=titration]
for(i in 1:nrow(homologs)){
if(homologs$homolog[i] %in% homologs_validation$genotype){
homologs$isogenic_delta_log10Ka[i] <- homologs_validation[genotype==homologs$homolog[i],delta_log10Ka]
homologs$isogenic_SE_log10Ka[i] <- homologs_validation[genotype==homologs$homolog[i],log10SE]
homologs$isogenic_log_MFI_FITC[i] <- homologs_validation[genotype==homologs$homolog[i],delta_logMFI_FITC]
homologs$isogenic_SE_log_MFI_FITC[i] <- homologs_validation[genotype==homologs$homolog[i],SE_delta_logMFI_FITC]
homologs$isogenic_percentFITCpos[i] <- homologs_validation[genotype==homologs$homolog[i],mean_FITCpos]
homologs$isogenic_SE_percentFITCpos[i] <- homologs_validation[genotype==homologs$homolog[i],stderr_FITCpos]
homologs$bind_SEM[i] <- sd(c(homologs$bind_lib1[i],homologs$bind_lib2[i]))/sqrt(2)
homologs$expr_SEM[i] <- sd(c(homologs$expr_lib1[i],homologs$expr_lib2[i]))/sqrt(2)
}else{
homologs$isogenic_delta_log10Ka[i] <- as.numeric(NA)
homologs$isogenic_SE_log10Ka[i] <- as.numeric(NA)
homologs$isogenic_log_MFI_FITC[i] <- as.numeric(NA)
homologs$isogenic_SE_log_MFI_FITC[i] <- as.numeric(NA)
homologs$isogenic_percentFITCpos[i] <- as.numeric(NA)
homologs$isogenic_SE_percentFITCpos[i] <- as.numeric(NA)
homologs$bind_SEM[i] <- sd(c(homologs$bind_lib1[i],homologs$bind_lib2[i]))/sqrt(2)
homologs$expr_SEM[i] <- sd(c(homologs$expr_lib1[i],homologs$expr_lib2[i]))/sqrt(2)
}
}
par(mfrow=c(1,2))
plot(homologs$isogenic_delta_log10Ka,homologs$bind_avg,pch=homologs$clade_pch,col=homologs$clade_color,cex=1.4,xlab="delta log10Ka,app (isogenic validation)",ylab="delta log10Ka,app (bulk DMS)",main=paste("R-squared:",format(summary(lm(homologs$bind_avg~homologs$isogenic_delta_log10Ka))$r.squared,digits=3)))
#SE error bars on isogenic
arrows(x0=homologs$isogenic_delta_log10Ka-homologs$isogenic_SE_log10Ka, y0=homologs$bind_avg,x1=homologs$isogenic_delta_log10Ka+homologs$isogenic_SE_log10Ka, y1=homologs$bind_avg,code=3,col=homologs$clade_color,angle=90,length=0.02)
#SE error bars on bulk
arrows(x0=homologs$isogenic_delta_log10Ka, y0=homologs$bind_avg-homologs$bind_SEM, x1=homologs$isogenic_delta_log10Ka, y1=homologs$bind_avg+homologs$bind_SEM,code=3,col=homologs$clade_color,angle=90,length=0.02)
plot(homologs$isogenic_log_MFI_FITC,homologs$expr_avg,pch=homologs$clade_pch,col=homologs$clade_color,cex=1.4,xlab="delta logMFI (isogenic validation)",ylab="delta logMFI (bulk DMS)",main=paste("R-squared:",format(summary(lm(homologs$expr_avg~homologs$isogenic_log_MFI_FITC))$r.squared,digits=3)))
#SE error bars on isogenic
arrows(x0=homologs$isogenic_log_MFI_FITC-homologs$isogenic_SE_log_MFI_FITC, y0=homologs$expr_avg, x1=homologs$isogenic_log_MFI_FITC+homologs$isogenic_SE_log_MFI_FITC, y1=homologs$expr_avg, code=3,col=homologs$clade_color,angle=90,length=0.02)
#SE bars on bulk
arrows(x0=homologs$isogenic_log_MFI_FITC, y0=homologs$expr_avg-homologs$expr_SEM, x1=homologs$isogenic_log_MFI_FITC, y1=homologs$expr_avg+homologs$expr_SEM, code=3,col=homologs$clade_color,angle=90,length=0.02)
invisible(dev.print(pdf, paste(config$single_mut_effects_dir,"/correlation_homolog-bulk-v-isogeniic_bind-expr.pdf",sep="")))
```
Second, single mutants of SARS-CoV-2. We can see expression and binding effects validate extremely well in this panel!
```{r single_mut_isogenic_validations, fig.width=8, fig.height=4.25, fig.align="center",dpi=300,dev="png",echo=F}
single_mut_validation <- data.table(read.csv(file="data/isogenic_titrations/results/point_mutants_summary.csv"))
single_mut_validation[,log10Ka := log10(1/Kd),by=titration]
single_mut_validation[,log10SE := 0.434*Kd_SE/Kd,by=titration]
single_mut_validation[log10Ka<6,log10Ka:=6]
single_mut_validation[,delta_log10Ka:=log10Ka - single_mut_validation[genotype=="wildtype",log10Ka],by=titration]
single_mut_validation[,delta_logMFI_FITC := mean_logMFI_FITC - single_mut_validation[genotype=="wildtype",mean_logMFI_FITC],by=titration]
single_mut_validation[,SE_delta_logMFI_FITC := sqrt(stderr_logMFI_FITC^2 + single_mut_validation[genotype=="wildtype",stderr_logMFI_FITC]^2),by=titration]
#add bulk values
for(i in 1:nrow(single_mut_validation)){
if(single_mut_validation$genotype[i] == "wildtype"){
single_mut_validation$bulk_delta_log10Ka[i] <- 0
single_mut_validation$bulk_delta_logMFI[i] <- 0
}else{
single_mut_validation$bulk_delta_log10Ka[i] <- betas[mutation==single_mut_validation$genotype[i], bind_avg]
single_mut_validation$bulk_delta_logMFI[i] <- betas[mutation==single_mut_validation$genotype[i], expr_avg]
}
}
par(mfrow=c(1,2))
x <- single_mut_validation$delta_log10Ka; y <- single_mut_validation$bulk_delta_log10Ka; fit <- lm(y~x)
plot(x,y,pch=16,cex=1.2,xlab="mutation effect on binding (isogenic validation)",ylab="mutation effect on binding (bulk DMS)",main=paste("R-squared:",format(summary(fit)$r.squared,digits=3)));abline(h=0,lty=2,col="gray50");abline(v=0,lty=2,col="gray50")
x <- single_mut_validation$delta_logMFI_FITC; y <- single_mut_validation$bulk_delta_logMFI; fit <- lm(y~x)
plot(x,y,pch=16,cex=1.42,xlab="mutation effect on expression (isogenic validation)",ylab="mutation effect on expression (bulk DMS)",main=paste("R-squared:",format(summary(fit)$r.squared,digits=3)));abline(h=0,lty=2,col="gray50");abline(v=0,lty=2,col="gray50")
invisible(dev.print(pdf, paste(config$single_mut_effects_dir,"/correlation_single-mut-bulk-v-isogenic_bind-expr.pdf",sep=""),useDingbats=F))
```
Furthermore, we can turn to other assays to validate the context of our yeast-based DMS measurements. First, we look at the functional measurements from [Letko et al. 2020](https://www.nature.com/articles/s41564-020-0688-y), who determined the ability of all RBD variants known at that time (so, no RaTG13 or Pangolin RBD) to enter cells as chimeras with SARS-CoV-1 Spike in VSV-G pseudotyped viruses and a luciferase reporter of cellular entry. We integrated measurements across several of their experimental panels, including Figs. 1e, 2a and 3b, and 3d. For 1e, 2a, and 3d, entry was measured in ACE2-expressing BHK cells, so the experiments should be more directly comparable -- in 3b, they measured entry into Vero cells. For each experiment, we calculate the mean entry from triplicate measurements, and calculate the variant fold-entry relative to the mean SARS-CoV-1 entry (which is present in each panel). We then calculate the mean and SEM across each of the experiments for which a genotype was measured. We present comparisons of our measurements to the average fold-entry relative to SARS-CoV-1 across all four experiments on the left, and just the BHK experiments on the right. Error bars are SEM from between 2 and 4 measurements; the SARS-CoV-1 point does not have error bars as it is defined as 1 across each experiment, and SARS-CoV-2 does not have error bars because it was only included in one of the experiments. As we can see below, the variants for which we measure no binding also do not support cellular entry. Among the genotypes for which we measure quantitative differences in binding affinity, we can see these differences correlate with the extent of cellular entry, though the functional entry measurements show some variability depending whether or not we include the additional Vero cells measurements. (Should we "break" the y-axis to more clearly present these data?)
```{r homologs_DMS_v_functional_pvirus_entry, fig.width=9, fig.height=4.75, fig.align="center",dpi=300,dev="png",echo=F}
Letko <- read.csv(file="data/lit-measurements/Letko_2020_data.csv",stringsAsFactors=F); Letko <- Letko[Letko$Figure %in% c("combined_BHK_Vero", "combined_BHK"),c("RBD","Figure","avg","SEM","n_measure")]
for(i in 1:nrow(homologs)){
if(homologs$homolog[i] %in% Letko$RBD){
homologs$Letko_mean_entry_all[i] <- Letko[Letko$RBD==homologs$homolog[i] & Letko$Figure=="combined_BHK_Vero","avg"]
homologs$Letko_SEM_entry_all[i] <- as.numeric(Letko[Letko$RBD==homologs$homolog[i] & Letko$Figure=="combined_BHK_Vero","SEM"])
homologs$Letko_mean_entry_BHK[i] <- Letko[Letko$RBD==homologs$homolog[i] & Letko$Figure=="combined_BHK","avg"]
homologs$Letko_SEM_entry_BHK[i] <- as.numeric(Letko[Letko$RBD==homologs$homolog[i] & Letko$Figure=="combined_BHK","SEM"])
}else{
homologs$Letko_mean_entry_all[i] <- as.numeric(NA)
homologs$Letko_SEM_entry_all[i] <- as.numeric(NA)
homologs$Letko_mean_entry_BHK[i] <- as.numeric(NA)
homologs$Letko_SEM_entry_BHK[i] <- as.numeric(NA)
}
}
par(mfrow=c(1,2))
plot(homologs$Letko_mean_entry_all,homologs$bind_avg,pch=homologs$clade_pch,col=homologs$clade_col,xlab="Cellular entry (a.u.), Letko et al. (all panels)",ylab="delta log10Ka,app (bulk DMS)",cex=1.4)
#SE error bars on Letko values
arrows(x0=homologs$Letko_mean_entry_all-homologs$Letko_SEM_entry_all, y0=homologs$bind_avg,x1=homologs$Letko_mean_entry_all+homologs$Letko_SEM_entry_all, y1=homologs$bind_avg,code=3,col=homologs$clade_color,angle=90,length=0.02)
#SE error bars on bulk avg measurement
arrows(x0=homologs$Letko_mean_entry_all, y0=homologs$bind_avg-homologs$bind_SEM, x1=homologs$Letko_mean_entry_all, y1=homologs$bind_avg+homologs$bind_SEM,code=3,col=homologs$clade_color,angle=90,length=0.02)
plot(homologs$Letko_mean_entry_BHK,homologs$bind_avg,pch=homologs$clade_pch,col=homologs$clade_col,xlab="Cellular entry (a.u.), Letko et al. (BHK panels only)",ylab="delta log10Ka,app (bulk DMS)",cex=1.4)
#SE error bars on Letko values
arrows(x0=homologs$Letko_mean_entry_BHK-homologs$Letko_SEM_entry_BHK, y0=homologs$bind_avg,x1=homologs$Letko_mean_entry_BHK+homologs$Letko_SEM_entry_BHK, y1=homologs$bind_avg,code=3,col=homologs$clade_color,angle=90,length=0.02)
#SE error bars on bulk avg measurement
arrows(x0=homologs$Letko_mean_entry_BHK, y0=homologs$bind_avg-homologs$bind_SEM, x1=homologs$Letko_mean_entry_BHK, y1=homologs$bind_avg+homologs$bind_SEM,code=3,col=homologs$clade_color,angle=90,length=0.02)
invisible(dev.print(pdf, paste(config$single_mut_effects_dir,"/correlation_homolog-bulk-v-Letko-entry.pdf",sep="")))
```
We'll also output these comparisons in tabular form
```{r homologs_bulk_v_functional_pvirus_entry_table}
kable(homologs[!is.na(Letko_mean_entry_all) & !is.na(bind_avg),.(homolog, clade, bind_avg, bind_SEM,
Letko_mean_entry_all, Letko_SEM_entry_all, Letko_mean_entry_BHK,
Letko_SEM_entry_BHK)],
col.names=c("Homolog","Clade","DMS delta-log10Ka","SE","Letko entry, all (a.u.)","SE","Letko entry, BHK (a.u.)","SE"), digits=3)
```
We can also validate our *isogenic* measurements against the Letko et al. phenotypes.
```{r homologs_isogenic_v_functional_pvirus_entry, fig.width=9, fig.height=4.75, fig.align="center",dpi=300,dev="png",echo=F}
par(mfrow=c(1,2))
plot(homologs$Letko_mean_entry_all,homologs$isogenic_delta_log10Ka,pch=homologs$clade_pch,col=homologs$clade_col,xlab="Cellular entry (a.u.), Letko et al. (all panels)",ylab="delta log10Ka,app (isogenic titration)",cex=1.4)
#SE error bars on Letko values
arrows(x0=homologs$Letko_mean_entry_all-homologs$Letko_SEM_entry_all, y0=homologs$isogenic_delta_log10Ka,x1=homologs$Letko_mean_entry_all+homologs$Letko_SEM_entry_all, y1=homologs$isogenic_delta_log10Ka,code=3,col=homologs$clade_color,angle=90,length=0.02)
#SE error bars on isogenic Kd measurement
arrows(x0=homologs$Letko_mean_entry_all, y0=homologs$isogenic_delta_log10Ka-homologs$isogenic_SE_log10Ka, x1=homologs$Letko_mean_entry_all, y1=homologs$isogenic_delta_log10Ka+homologs$isogenic_SE_log10Ka,code=3,col=homologs$clade_color,angle=90,length=0.02)
plot(homologs$Letko_mean_entry_BHK,homologs$isogenic_delta_log10Ka,pch=homologs$clade_pch,col=homologs$clade_col,xlab="Cellular entry (a.u.), Letko et al. (BHK panels only)",ylab="delta log10Ka,app (isogenic titration)",cex=1.4)
#SE error bars on Letko values
arrows(x0=homologs$Letko_mean_entry_BHK-homologs$Letko_SEM_entry_BHK, y0=homologs$isogenic_delta_log10Ka,x1=homologs$Letko_mean_entry_BHK+homologs$Letko_SEM_entry_BHK, y1=homologs$isogenic_delta_log10Ka,code=3,col=homologs$clade_color,angle=90,length=0.02)
#SE error bars on bulk avg measurement
arrows(x0=homologs$Letko_mean_entry_BHK, y0=homologs$isogenic_delta_log10Ka-homologs$isogenic_SE_log10Ka, x1=homologs$Letko_mean_entry_BHK, y1=homologs$isogenic_delta_log10Ka+homologs$isogenic_SE_log10Ka,code=3,col=homologs$clade_color,angle=90,length=0.02)
invisible(dev.print(pdf, paste(config$single_mut_effects_dir,"/correlation_homolog-isogenic-v-Letko-entry.pdf",sep="")))
```
Table of isogenic measurements vs Letko phenotype:
```{r homologs_isogenic_v_functional_pvirus_entry_table}
kable(homologs[!is.na(Letko_mean_entry_all) & !is.na(isogenic_delta_log10Ka),.(homolog, clade, isogenic_delta_log10Ka, isogenic_SE_log10Ka,
Letko_mean_entry_all, Letko_SEM_entry_all, Letko_mean_entry_BHK,
Letko_SEM_entry_BHK)],
col.names=c("Homolog","Clade","Isogenic delta-log10Ka","SE","Letko entry, all (a.u.)","SE","Letko entry, BHK (a.u.)","SE"), digits=3)
```
## Relationship between expression and binding fits at a per-barcode level
In theory, titration assays should normalize out expression effects of mutations, and therefore be less susceptible to expression-induced artifacts in binding scores than traditional single-concentration yeast display assays. We will dig into this a bit more in the `structure_function.Rmd` notebook for our single-mutant effect parameters, but the general premise is that in classic single-concentration yeast display assays, simple enhancement of surface expression leads to higher ligand labeling, even if the underlying *K*<sub>D</sub> is unchanged in the mutant variant. This leads to *uninteresting* simple correlation between expression and binding in DMS type data. In contrast, our binding phenotype is determined from a self-contained titration series for each barcode, for which the response plateau and baseline can vary as fit parameters. This assay should therefore be less susceptible to expression artifacts, because global expression changes can be accounted for with variation in the response parameter -- therefore, any remaining correlation between expression and binding should be due to actual biological correlation (i.e. mutations that destabilize the protein will also intrinsically decrease the thermodynamic affinity). One premise of this argument is that the response parameter of the titration curve fits correlates with expression effects.
Below, we generate plots for all barcodes showing the relationship between the barcode's response fit value and the expression phenotype for that barcode. We exclude barcodes with titration fit Kds > 1e-7, as these response variables are just randomly guessed since there was no actual response to fit across our concentration range.
```{r response_v_expression, fig.width=4, fig.height=4.25, fig.align="center",dpi=300,dev="png",echo=F}
bc_dt <- merge(bc_bind[,.(library, target, barcode, variant_call_support, avgcount, log10Ka, delta_log10Ka, log10SE, response, baseline, nMSR, variant_class, aa_substitutions, n_aa_substitutions)], bc_expr[,.(library, target, barcode, variant_call_support, total_count, ML_meanF, delta_ML_meanF, var_ML_meanF, variant_class, aa_substitutions, n_aa_substitutions)], sort=F)
plot(bc_dt[!is.na(log10Ka) & log10Ka > 8,ML_meanF],bc_dt[!is.na(log10Ka) & log10Ka > 8,response],pch=16,col="#00000002",xlab="barcode expression",ylab="barcode titration response parameter")
invisible(dev.print(pdf, paste(config$single_mut_effects_dir,"/correlation_bc-response-v-expression.pdf",sep="")))
```