-
Notifications
You must be signed in to change notification settings - Fork 7
/
TODO.txt
225 lines (175 loc) · 8.42 KB
/
TODO.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
Feedback
----------
Add plotting routines for MV models:
model.plot_screeplot(...) etc
More natural API:
model.hotellings_t2
model.squared_prediction_error -> model.spe
Plots should accept a pc_depth argument, then a 3D plot is made instead
PLS has no SPE limit
R2* attributes from MV models are not filled in.
Printing model object gives: 'PCA_missing_values' object has no attribute 'copy'
Implement "fit_transform" method from PCA in sklearn
Add a post-filter
* set all values which are +/- eps absoulute from zero; force them to zero
-----------
Alignment in general:
Get a percentage scale for the x-axis. With a particular specified resolution.
====
Misssing data in batch
# Fill in missing values
def fill_na_values(df):
"Very very crude method for now"
return df.fillna(method='bfill').fillna(method='ffill')
missing_filled = {}
for batch_id, batch in df_dict.items():
missing_filled[batch_id] = fill_na_values(batch)
-----
Add tests for f_cross and f_elbow
------
Elbow method;
https://github.com/Mathemilda/Numeric_ElbowMethod_For_K-means/blob/master/EstimatedClusterNumberWithWCSS.py
--
Prediction interval function to linear regression. Accepts any x input, gives the PI.
---
Batch data: missing values and smoothing tool:
* lowess and SG filter for batch data
-----
add documentation
-----
Product development notebook:
Output
3 dofs available in general
1 dof
2 dof.
PCA space. For outputs. On 7 properties A to G.
Shows the correlations and quickly seeing gaps
PCA on processing settings and blend properties. RX. Related to the scores. Often a 1 to 1 relationship of the outputs to the scores. So this is valid.
Use the 5 properties of A to E from foods data set. Show that constraints are also possible. Indicates multiple solutions. Also mention the nulll space. Of multiple solutions. T -> X.
Then go to the cinac case study. Show that the lvs are interesting and have meaning. We can design new product here. Y -> T.
Now we can also go directly from Y to X, with a NLP solution. But it goes via the scores T. And it can handle additional constraints.
Would be cool to demo this.
Lastly how do you start out?
* dB is partially available.
* R selected from a regular mixture DoE.
* Calculate XR. Do a doe in this space to select a subset of experimental conditions.
* run those expts.
* Start up your mixture model
* See where there are holes literally in the spaces.
* Use LV model inversion to find which process settings and ratios and materials are needed to fill in those gaps
* Targeted DoE.
raw material properties affect what final product properties is made clear in the pls loadings plot. Again, something that is not seen in NN model.
Build a model explorer for the pls model of muteii for the rubbers, showing how the blend properties correlated with the 8 outputs.
Show the approach on p 23 of muteii paper for how to start out. You can use what you have and add to it via a d optimal doe.
Show figure 4 and emph that the full 111 expats were not needed. Only 17.
-----
---
Raincloud plot
---
Don't force dict keys to be string!!
SOMEHOW, batch_id gets added as the first column; causing crashes later when after alignment is done.
in "align_with_path(md_path, batch, initial_row)", line: synced.iloc[row, :] = np.nanmean(temp, axis=0)
-----
Univariate:
* does it come from a given distribution?
https://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm
Outlier detection: https://pyod.readthedocs.io/en/latest/
P7: Grubb's test/ Tietjen-Moore:
https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h1.htm
Multiple outliers: *** https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h3.htm
https://www.itl.nist.gov/div898//software/dataplot/refman1/auxillar/grubtest.htm
https://www.itl.nist.gov/div898//software/dataplot/refman1/auxillar/tietjen.htm
-----
Comparisons
* IMplement Holm test: https://stats.libretexts.org/Bookshelves/Applied_Statistics/Book%3A_Learning_Statistics_with_R_-_A_tutorial_for_Psychology_Students_and_other_Beginners_(Navarro)/14%3A_Comparing_Several_Means_(One-way_ANOVA)/14.06%3A_Multiple_Comparisons_and_Post_Hoc_Tests
------
Multivariate
VIP for PCA and PLS
Can you calculate prediction intervals for PLS?
Confi intervals jackknife coeff?
Contribution plots
Use col in X with greatest variance for the starting score, after iteration 1
Contribution plot calculation
PLS with missing values: code below
* PLS with TSR methods: :
https://riunet.upv.es/bitstream/id/303213/PCA%20model%20building%20with%20missing%20data%20new%20proposals%20and%20a%20comparative%20study%20-%20Folch-Fortuny.pdf
------------
function v = robust_scale(a)
% Using the formula from Mosteller and Tukey, Data Analysis and Regression,
% p 207-208, 1977.
n = numel(a);
location = median(a);
spread_MAD = median(abs(a-location));
ui = (a - location)/(6*spread_MAD);
% Valid u_i values used in the summation:
vu = ui.^2 <= 1;
num = (a(vu)-location).^2 .* (1-ui(vu).^2).^4;
den = (1-ui(vu).^2) .* (1-5*ui(vu).^2);
v = n * sum(num) / (sum(den))^2;
end
#---------
Robust PLS investigated?
S. Serneels, C. Croux, P. Filzmoser, P.J. Van Espen, Partial Robust M-regression, Chemometrics and Intelligent Laboratory Systems, 79 (2005), 55-64
#-----------
def calc_limits(self):
"""
Calculate the limits for the latent variable model.
References
----------
[1] SPE limits: Nomikos and MacGregor, Multivariate SPC Charts for
Monitoring Batch Processes. Technometrics, 37, 41-59, 1995.
[2] T2 limits: Johnstone and Wischern?
[3] Score limits: two methods
A: Assume that scores are from a two-sided t-distribution with N-1
degrees of freedom. Based on the central limit theorem
B: (t_a/s_a)^2 ~ F_alpha(1, N-1) distribution if scores are
assumed to be normally distributed, and s_a is chi-squared
variable with N-1 DOF.
critical F = scipy.stats.f.ppf(0.95, 1, N-1)
which happens to be equal to (scipy.stats.t.ppf(0.975, N-1))^2,
as expected. Therefore the alpha limit for t_a is equal to
np.sqrt(scipy.stats.f.ppf(0.95, 1, N-1)) * S[:,a]
Both methods give the same limits. In fact, some previous code was:
t_ppf_95 = scipy.stats.t.ppf(0.975, N-1)
S[:,a] = np.std(this_lv, ddof=0, axis=0)
lim.t['95.0'][a, :] = t_ppf_95 * S[:,a]
which assumes the scores were t-distributed. In fact, the scores
are not t-distributed, only the (score_a/s_a) is t-distributed, and
the scores are NORMALLY distributed.
S[:,a] = np.std(this_lv, ddof=0, axis=0)
lim.t['95.0'][a, :] = n_ppf_95 * S[:,a]
lim.t['99.0'][a, :] = n_ppf_99 * [:,a]
From the CLT: we divide by N, not N-1, but stddev is calculated
with the N-1 divisor.
"""
for block in self.blocks:
N = block.N
# SPE limits using Nomikos and MacGregor approximation
for a in xrange(block.A):
SPE_values = block.stats.SPE[:, a]
var_SPE = np.var(SPE_values, ddof=1)
avg_SPE = np.mean(SPE_values)
chi2_mult = var_SPE / (2.0 * avg_SPE)
chi2_DOF = (2.0 * avg_SPE ** 2) / var_SPE
for siglevel_str in block.lim.SPE.keys():
siglevel = float(siglevel_str) / 100
block.lim.SPE[siglevel_str][:, a] = chi2_mult * stats.chi2.ppf(
siglevel, chi2_DOF
)
# For batch blocks: calculate instantaneous SPE using a window
# of width = 2w+1 (default value for w=2).
# This allows for (2w+1)*N observations to be used to calculate
# the SPE limit, instead of just the usual N observations.
#
# Also for batch systems:
# low values of chi2_DOF: large variability of only a few variables
# high values: more stable periods: all k's contribute
for siglevel_str in block.lim.T2.keys():
siglevel = float(siglevel_str) / 100
mult = (a + 1) * (N - 1) * (N + 1) / (N * (N - (a + 1)))
limit = stats.f.ppf(siglevel, a + 1, N - (a + 1))
block.lim.T2[siglevel_str][:, a] = mult * limit
for siglevel_str in block.lim.t.keys():
alpha = (1 - float(siglevel_str) / 100.0) / 2.0
n_ppf = stats.norm.ppf(1 - alpha)
block.lim.t[siglevel_str][:, a] = n_ppf * block.S[a]