TODO.txt

Feedback
----------
Add plotting routines for MV models:
    model.plot_screeplot(...) etc

More natural API:
    model.hotellings_t2
    model.squared_prediction_error -> model.spe

Plots should accept a pc_depth argument, then a 3D plot is made instead


PLS has no SPE limit


R2* attributes from MV models are not filled in.

Printing model object gives: 'PCA_missing_values' object has no attribute 'copy'

Implement "fit_transform" method from PCA in sklearn

Add a post-filter
* set all values which are +/- eps absoulute from zero; force them to zero
-----------

Alignment in general:

Get a percentage scale for the x-axis. With a particular specified resolution.
====

Misssing data in batch

# Fill in missing values
def fill_na_values(df):
    "Very very crude method for now"
    return df.fillna(method='bfill').fillna(method='ffill')

missing_filled = {}
for batch_id, batch in  df_dict.items():
    missing_filled[batch_id] = fill_na_values(batch)

-----
Add tests for f_cross and f_elbow

------
Elbow method;
https://github.com/Mathemilda/Numeric_ElbowMethod_For_K-means/blob/master/EstimatedClusterNumberWithWCSS.py

--
Prediction interval function to linear regression. Accepts any x input, gives the PI.
---
Batch data: missing values and smoothing tool:

* lowess and SG filter for batch data

-----
add documentation
-----
Product development notebook:

Output
3 dofs available in general
1 dof
2 dof.

PCA space. For outputs. On 7 properties A to G.
Shows the correlations and quickly seeing gaps

PCA on processing settings and blend properties. RX. Related to the scores. Often a 1 to 1 relationship of the outputs to the scores. So this is valid.

Use the 5 properties of A to E from foods data set. Show that constraints are also possible. Indicates multiple solutions. Also mention the nulll space. Of multiple solutions. T -> X.

Then go to the cinac case study. Show that the lvs are interesting and have meaning. We can design new product here. Y -> T.

Now we can also go directly from Y to X, with a NLP solution. But it goes via the scores T. And it can handle additional constraints.
Would be cool to demo this.

Lastly how do you start out?
* dB is partially available.
* R selected from a regular mixture DoE.
* Calculate XR. Do a doe in this space to select a subset of experimental conditions.
* run those expts.
* Start up your mixture model
* See where there are holes literally in the spaces.
* Use LV model inversion to find which process settings and ratios and materials are needed to fill in those gaps
* Targeted DoE.


raw material properties affect what final product properties is made clear in the pls loadings plot. Again, something that is not seen in NN model.

Build a model explorer for the pls model of muteii for the rubbers, showing how the blend properties correlated with the 8 outputs.

Show the approach on p 23 of muteii paper for how to start out. You can use what you have and add to it via a d optimal doe.
Show figure 4 and emph that the full 111 expats were not needed. Only 17.


-----


---
Raincloud plot
---
Don't force dict keys to be string!!

SOMEHOW, batch_id gets added as the first column; causing crashes later when after alignment is done.
in "align_with_path(md_path, batch, initial_row)", line: synced.iloc[row, :] = np.nanmean(temp, axis=0)
-----

Univariate:
* does  it come from a given distribution?
    https://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm
Outlier detection: https://pyod.readthedocs.io/en/latest/
P7: Grubb's test/ Tietjen-Moore:
https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h1.htm
Multiple outliers: *** https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h3.htm
https://www.itl.nist.gov/div898//software/dataplot/refman1/auxillar/grubtest.htm
https://www.itl.nist.gov/div898//software/dataplot/refman1/auxillar/tietjen.htm
-----
Comparisons

* IMplement Holm test: https://stats.libretexts.org/Bookshelves/Applied_Statistics/Book%3A_Learning_Statistics_with_R_-_A_tutorial_for_Psychology_Students_and_other_Beginners_(Navarro)/14%3A_Comparing_Several_Means_(One-way_ANOVA)/14.06%3A_Multiple_Comparisons_and_Post_Hoc_Tests

------
Multivariate

VIP for PCA and PLS
Can you calculate prediction intervals for PLS?
Confi intervals jackknife coeff?
Contribution plots
Use col in X with greatest variance for the starting score, after iteration 1
Contribution plot calculation
PLS with missing values: code below
* PLS with TSR methods: :
https://riunet.upv.es/bitstream/id/303213/PCA%20model%20building%20with%20missing%20data%20new%20proposals%20and%20a%20comparative%20study%20-%20Folch-Fortuny.pdf
------------
function  v = robust_scale(a)
% Using the formula from Mosteller and Tukey, Data Analysis and Regression,
% p 207-208, 1977.
    n = numel(a);
    location = median(a);
    spread_MAD = median(abs(a-location));
    ui = (a - location)/(6*spread_MAD);
    % Valid u_i values used in the summation:
    vu = ui.^2 <= 1;
    num = (a(vu)-location).^2 .* (1-ui(vu).^2).^4;
    den = (1-ui(vu).^2) .* (1-5*ui(vu).^2);
    v = n * sum(num) / (sum(den))^2;
end
#---------
Robust PLS investigated?
    S. Serneels, C. Croux, P. Filzmoser, P.J. Van Espen, Partial Robust M-regression, Chemometrics and Intelligent Laboratory Systems, 79 (2005), 55-64
#-----------
def calc_limits(self):
    """
    Calculate the limits for the latent variable model.

    References
    ----------
    [1]  SPE limits: Nomikos and MacGregor, Multivariate SPC Charts for
            Monitoring Batch Processes. Technometrics, 37, 41-59, 1995.

    [2]  T2 limits: Johnstone and Wischern?
    [3]  Score limits: two methods

            A: Assume that scores are from a two-sided t-distribution with N-1
            degrees of freedom.  Based on the central limit theorem

            B: (t_a/s_a)^2 ~ F_alpha(1, N-1) distribution if scores are
            assumed to be normally distributed, and s_a is chi-squared
            variable with N-1 DOF.

            critical F = scipy.stats.f.ppf(0.95, 1, N-1)
            which happens to be equal to (scipy.stats.t.ppf(0.975, N-1))^2,
            as expected.  Therefore the alpha limit for t_a is equal to
            np.sqrt(scipy.stats.f.ppf(0.95, 1, N-1)) * S[:,a]

            Both methods give the same limits. In fact, some previous code was:
            t_ppf_95 = scipy.stats.t.ppf(0.975, N-1)
            S[:,a] = np.std(this_lv, ddof=0, axis=0)
            lim.t['95.0'][a, :] = t_ppf_95 * S[:,a]

            which assumes the scores were t-distributed.  In fact, the scores
            are not t-distributed, only the (score_a/s_a) is t-distributed, and
            the scores are NORMALLY distributed.
            S[:,a] = np.std(this_lv, ddof=0, axis=0)
            lim.t['95.0'][a, :] = n_ppf_95 * S[:,a]
            lim.t['99.0'][a, :] = n_ppf_99 * [:,a]

            From the CLT: we divide by N, not N-1, but stddev is calculated
            with the N-1 divisor.
    """
    for block in self.blocks:
        N = block.N
        # SPE limits using Nomikos and MacGregor approximation
        for a in xrange(block.A):
            SPE_values = block.stats.SPE[:, a]
            var_SPE = np.var(SPE_values, ddof=1)
            avg_SPE = np.mean(SPE_values)
            chi2_mult = var_SPE / (2.0 * avg_SPE)
            chi2_DOF = (2.0 * avg_SPE ** 2) / var_SPE
            for siglevel_str in block.lim.SPE.keys():
                siglevel = float(siglevel_str) / 100
                block.lim.SPE[siglevel_str][:, a] = chi2_mult * stats.chi2.ppf(
                    siglevel, chi2_DOF
                )

            # For batch blocks: calculate instantaneous SPE using a window
            # of width = 2w+1 (default value for w=2).
            # This allows for (2w+1)*N observations to be used to calculate
            # the SPE limit, instead of just the usual N observations.
            #
            # Also for batch systems:
            # low values of chi2_DOF: large variability of only a few variables
            # high values: more stable periods: all k's contribute

            for siglevel_str in block.lim.T2.keys():
                siglevel = float(siglevel_str) / 100
                mult = (a + 1) * (N - 1) * (N + 1) / (N * (N - (a + 1)))
                limit = stats.f.ppf(siglevel, a + 1, N - (a + 1))
                block.lim.T2[siglevel_str][:, a] = mult * limit

            for siglevel_str in block.lim.t.keys():
                alpha = (1 - float(siglevel_str) / 100.0) / 2.0
                n_ppf = stats.norm.ppf(1 - alpha)
                block.lim.t[siglevel_str][:, a] = n_ppf * block.S[a]