[DAPHNE-573] Corrections in PCA script.

- Fixed a few little bugs in the Principle Component Analysis (PCA) DaphneDSL script. - The bugs were most likely due to an incomplete translation from the PCA script of Apache SystemDS. - We could trigger these bugs now, since we can run this script only since recently (since we have the eigen()-kernel). - Now the script runs well and non-vectorized and vectorized execution yield the same results (in a quick manual test). - Updated the README.md overview of the available scripts. - Closes #573.
daphne-eu · Jul 11, 2023 · d0f4e6d · d0f4e6d
1 parent 5f73973
commit d0f4e6d
Show file tree

Hide file tree

Showing 2 changed files with 20 additions and 20 deletions.
diff --git a/scripts/algorithms/README.md b/scripts/algorithms/README.md
@@ -50,13 +50,7 @@ loc(fused["scripts/algorithms/components.daph":40:17, "scripts/algorithms/compon
 ```bash
 bin/daphne scripts/algorithms/gnmf.daph rank=2 n=100 e=500 W=\"outW.csv\" H=\"outH.csv\"
 ```
-*Does not work with vectorized execution (`--vec`) at the moment.*
-<!-- error with --vec:
-terminate called after throwing an instance of 'std::runtime_error'
-terminate called recursively
-  what():  colIdx is out of bounds
-Aborted (core dumped)
--->
+*Does not work with vectorized execution (`--vec`) at the moment (executes, but with different results).*
 
 ### Linear Regression using the Direct Solve method
 
@@ -77,14 +71,20 @@ bin/daphne scripts/algorithms/lmDS.daph XY=\"data/wine.csv\" icpt=0 reg=0.000000
 ```bash
 bin/daphne scripts/algorithms/multiLogReg.daph XY=\"data/wine.csv\" B=\"output.csv\"
 ```
-<!-- successful with --vec -->
+*Does not work with vectorized execution (`--vec`) at the moment*
+<!-- error with --vec:
+daphne: /daphne/src/runtime/local/datastructures/DenseMatrix.cpp:62: DenseMatrix<ValueType>::DenseMatrix(const DenseMatrix<ValueType>*, size_t, size_t, size_t, size_t) [with ValueType = double; size_t = long unsigned int]: Assertion `((rowLowerIncl < src->numRows) || rowLowerIncl == 0) && "rowLowerIncl is out of bounds"' failed.
+daphne: /daphne/src/runtime/local/datastructures/DenseMatrix.cpp:63: DenseMatrix<ValueType>::DenseMatrix(const DenseMatrix<ValueType>*, size_t, size_t, size_t, size_t) [with ValueType = double; size_t = long unsigned int]: Assertion `(rowUpperExcl <= src->numRows) && "rowUpperExcl is out of bounds"' failed.
+double free or corruption (out)
+Segmentation fault (core dumped)
+-->
 
-### ~Principal Component Analysis (PCA)~
+### Principal Component Analysis (PCA)
 
 ```bash
-bin/daphne scripts/algorithms/pca.daph X=\"data/wine.csv\" K=2 center=true scale=false
+bin/daphne scripts/algorithms/pca.daph X=\"data/wine.csv\" K=2 center=true scale=false Xout=\"outX.csv\" Mout=\"outM.csv\"
 ```
-*Does not work yet, because we still lack a kernel for `eigen()`.*
+<!-- successful with --vec -->
 
 <!--
 bin/daphne test/api/cli/algorithms/kmeans.daphne r=1000 f=10 c=5 i=3

diff --git a/scripts/algorithms/pca.daph b/scripts/algorithms/pca.daph
@@ -29,8 +29,8 @@
 # ---------------------------------------------------------------------------------------------
 # X      Matrix  ---      Input feature matrix
 # K      Int     2        Number of reduced dimensions (i.e., columns)
-# Center Boolean true     Indicates whether or not to center the feature matrix
-# Scale  Boolean true     Indicates whether or not to scale the feature matrix
+# center Boolean true     Indicates whether or not to center the feature matrix
+# scale  Boolean true     Indicates whether or not to scale the feature matrix
 
 # RETURN VALUES
 # ---------------------------------------------------------------------------------------------
@@ -55,6 +55,7 @@ if( $center ) {
     X = X - mean(X, 1);
 }
 if( $scale ) {
+    # TODO This does not use the original X before centering anymore. Is that okay?
     scaleFactor = sqrt(sum(X^2.0, 1)/Nm1);
     # Replace entries in the scale factor that are 0 and NaN with 1.
     # To avoid division by 0 or NaN, introducing NaN to the ouput.
@@ -68,23 +69,22 @@ mu = mean(X, 1);
 C = (t(X) @ X) / Nm1 - (t(mu) @ mu) * (as.f64(N)/Nm1);
 
 # compute eigen vectors and values
-# TODO full integration of eigen
 evalues, evectors = eigen(C);
 
 # extract dominant eigenvalues and eigenvectors 
-# TODO sorting of matrices
 # Note: can be emulated by:  cbind(evectors, t(evalues)) + cast, sort frame, + extract
-decreasing_Idx = order(evalues, 1, true, true);
-diagmat = ctable(seq(1.0, N, 1.0), decreasing_Idx);
+decreasing_Idx = order(evalues, 0, false, true);
+# TODO This cast should not be necessary (type promotion in ctable).
+diagmat = ctable(seq(0.0, M - 1, 1.0), as.f64(decreasing_Idx));
 evalues = diagmat @ evalues;
 evectors = evectors @ diagmat;
-eval_dominant = evalues[1:K, 1];
-evec_dominant = evectors[,1:K];
+eval_dominant = evalues[0:K, 0];
+evec_dominant = evectors[,0:K];
 
 # Construct new data set by treating computed dominant eigenvectors as the basis vectors
 Xout = X @ evec_dominant;
 Mout = evec_dominant;
 
 # Write out results
 writeMatrix(Xout, $Xout);
-writeMatrix(Xout, $Mout);
+writeMatrix(Mout, $Mout);