Skip to content

Commit

Permalink
[DAPHNE-573] Corrections in PCA script.
Browse files Browse the repository at this point in the history
- Fixed a few little bugs in the Principle Component Analysis (PCA) DaphneDSL script.
  - The bugs were most likely due to an incomplete translation from the PCA script of Apache SystemDS.
  - We could trigger these bugs now, since we can run this script only since recently (since we have the eigen()-kernel).
- Now the script runs well and non-vectorized and vectorized execution yield the same results (in a quick manual test).
- Updated the README.md overview of the available scripts.
- Closes #573.
  • Loading branch information
pdamme committed Jul 11, 2023
1 parent 5f73973 commit d0f4e6d
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 20 deletions.
22 changes: 11 additions & 11 deletions scripts/algorithms/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,13 +50,7 @@ loc(fused["scripts/algorithms/components.daph":40:17, "scripts/algorithms/compon
```bash
bin/daphne scripts/algorithms/gnmf.daph rank=2 n=100 e=500 W=\"outW.csv\" H=\"outH.csv\"
```
*Does not work with vectorized execution (`--vec`) at the moment.*
<!-- error with --vec:
terminate called after throwing an instance of 'std::runtime_error'
terminate called recursively
what(): colIdx is out of bounds
Aborted (core dumped)
-->
*Does not work with vectorized execution (`--vec`) at the moment (executes, but with different results).*

### Linear Regression using the Direct Solve method

Expand All @@ -77,14 +71,20 @@ bin/daphne scripts/algorithms/lmDS.daph XY=\"data/wine.csv\" icpt=0 reg=0.000000
```bash
bin/daphne scripts/algorithms/multiLogReg.daph XY=\"data/wine.csv\" B=\"output.csv\"
```
<!-- successful with --vec -->
*Does not work with vectorized execution (`--vec`) at the moment*
<!-- error with --vec:
daphne: /daphne/src/runtime/local/datastructures/DenseMatrix.cpp:62: DenseMatrix<ValueType>::DenseMatrix(const DenseMatrix<ValueType>*, size_t, size_t, size_t, size_t) [with ValueType = double; size_t = long unsigned int]: Assertion `((rowLowerIncl < src->numRows) || rowLowerIncl == 0) && "rowLowerIncl is out of bounds"' failed.
daphne: /daphne/src/runtime/local/datastructures/DenseMatrix.cpp:63: DenseMatrix<ValueType>::DenseMatrix(const DenseMatrix<ValueType>*, size_t, size_t, size_t, size_t) [with ValueType = double; size_t = long unsigned int]: Assertion `(rowUpperExcl <= src->numRows) && "rowUpperExcl is out of bounds"' failed.
double free or corruption (out)
Segmentation fault (core dumped)
-->

### ~Principal Component Analysis (PCA)~
### Principal Component Analysis (PCA)

```bash
bin/daphne scripts/algorithms/pca.daph X=\"data/wine.csv\" K=2 center=true scale=false
bin/daphne scripts/algorithms/pca.daph X=\"data/wine.csv\" K=2 center=true scale=false Xout=\"outX.csv\" Mout=\"outM.csv\"
```
*Does not work yet, because we still lack a kernel for `eigen()`.*
<!-- successful with --vec -->

<!--
bin/daphne test/api/cli/algorithms/kmeans.daphne r=1000 f=10 c=5 i=3
Expand Down
18 changes: 9 additions & 9 deletions scripts/algorithms/pca.daph
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@
# ---------------------------------------------------------------------------------------------
# X Matrix --- Input feature matrix
# K Int 2 Number of reduced dimensions (i.e., columns)
# Center Boolean true Indicates whether or not to center the feature matrix
# Scale Boolean true Indicates whether or not to scale the feature matrix
# center Boolean true Indicates whether or not to center the feature matrix
# scale Boolean true Indicates whether or not to scale the feature matrix

# RETURN VALUES
# ---------------------------------------------------------------------------------------------
Expand All @@ -55,6 +55,7 @@ if( $center ) {
X = X - mean(X, 1);
}
if( $scale ) {
# TODO This does not use the original X before centering anymore. Is that okay?
scaleFactor = sqrt(sum(X^2.0, 1)/Nm1);
# Replace entries in the scale factor that are 0 and NaN with 1.
# To avoid division by 0 or NaN, introducing NaN to the ouput.
Expand All @@ -68,23 +69,22 @@ mu = mean(X, 1);
C = (t(X) @ X) / Nm1 - (t(mu) @ mu) * (as.f64(N)/Nm1);

# compute eigen vectors and values
# TODO full integration of eigen
evalues, evectors = eigen(C);

# extract dominant eigenvalues and eigenvectors
# TODO sorting of matrices
# Note: can be emulated by: cbind(evectors, t(evalues)) + cast, sort frame, + extract
decreasing_Idx = order(evalues, 1, true, true);
diagmat = ctable(seq(1.0, N, 1.0), decreasing_Idx);
decreasing_Idx = order(evalues, 0, false, true);
# TODO This cast should not be necessary (type promotion in ctable).
diagmat = ctable(seq(0.0, M - 1, 1.0), as.f64(decreasing_Idx));
evalues = diagmat @ evalues;
evectors = evectors @ diagmat;
eval_dominant = evalues[1:K, 1];
evec_dominant = evectors[,1:K];
eval_dominant = evalues[0:K, 0];
evec_dominant = evectors[,0:K];

# Construct new data set by treating computed dominant eigenvectors as the basis vectors
Xout = X @ evec_dominant;
Mout = evec_dominant;

# Write out results
writeMatrix(Xout, $Xout);
writeMatrix(Xout, $Mout);
writeMatrix(Mout, $Mout);

0 comments on commit d0f4e6d

Please sign in to comment.