In this repository, we provide the codes, results, artifacts, and post-processing scripts used to create the results in the following paper:
B. W. Larsen, T. G. Kolda. Practical Leverage-Based Sampling for Low-Rank Tensor Decomposition. arXiv:2006.16438, 2020. http://arxiv.org/abs/2006.16438
- Requires MATLAB, R2018a or later
- Requires Tensor Toolbox for MATLAB (with
cp_arls_lev.m
) added to MATLAB path (https://gitlab.com/tensors/tensor_toolbox/-/tree/cp_arls_lev) - Requires this repository
- Each experiment has two executable m-files of the form
<exp>_runs_<tensor>.m
and<exp>_post_<tensor>.m
- First, run the file
<exp>_runs_<tensor>.m
.- Set
do_diary
to create a log and save the results - Set
do_gitchecks
to false - This will produce a file called
artifact_<exp>_runs_<tensor>_<YYYY_MM_DD_HHMM>_diary.txt
and another calledartifact_<exp>_runs_<tensor>_<YYYY_MM_DD_HHMM>-results.mat
- Set
- Second, run the file
<exp>_post_<tensor>.m
to analyze the output and produce the charts and diagrams used in the paper.- If needed, modify
<exp>_post_<tensor>.m
to read in the artifact file that was just produced.
- If needed, modify
The experiments were run on Sandia's Kahuna cluster using a dedicated node and MATLAB 2018a. Two types of nodes were used depending on the size of the tensor:
- Compute Nodes - Dual Socket Intel E5-2683v3 2.00 GHz CPU with 256 GB memory; used for the smaller tensors Uber and Enron.
- Zen Nodes - Dual Socket AMD Epyc 7601 2.20 GHz CPU with 1 TB memory; used for the larger tensors Amazon and Reddit.
diarysetup.m
- Name and create the log file via MATLAB's diary commandgitstatus.m
- Function to capture the GIT hash and any changeskahuna_slurm_batch.sh
- Script for submission of jobs to compute nodes on Sandia's Kahuna clusterkahuna_slurm_batch_1xgpu.sh
- Script for submission of jobs to 1xgpu nodes on Sandia's Kahuna clusterkahuna_slurm_batch_zen.sh
- Script for submission of jobs to zen nodes on Sandia's Kahuna clusterrnginit.m
- Function that controls the random state via printable strings
The job call is formatted as follows:
sbatch kahuna_slurm_batch.sh "arls_runs_uber"
sbatch kahuna_slurm_batch_zen.sh "arls_runs_reddit"
The last argument is the name of the MATLAB script in double quotes, without the .m
extension.
The experiments are based on four large-scale sparse tensors from FROSTT (http://frostt.io/), the largest having nearly 5B nonzeros:
- Uber: 4-way count tensor of New York City Taxi Cab pickups between April--August 2014, comprising 183 days × 24 hours × 1,140 latitudes × 1,717 longitudes with 3M nonzeros.
- Amazon: 3-way count tensor based on user review text of 5M users × 2M words × 2M reviews with 2B nonzeros.
- Reddit: 3-way log-count tensor of social network postings of 8M users × 200K subreddits × 8M words with 5B nonzeros.
- Enron: 4-way log-count tensor of emails from 6K senders × 6K receivers × 244K words × 1K days with 54M nonzeros.
Name | Size | Nonzeros | Density |
---|---|---|---|
Uber | 183 × 24 × 1,140 × 1,717 | 3,309,490 | 0.038% |
Enron | 6,066 × 5,699 × 244,268 × 1,176 | 54,202,099 | 5.5 ×10−7% |
Amazon | 4,821,207 × 1,774,269 × 1,805,187 | 1,741,809,018 | 1.1 ×10−8% |
8,211,298 × 176,962 × 8,116,559 | 4,687,474,081 | 4.0 ×10−8% |
Note that the log-count is a modification of the raw data on FROSTT for which each the function log(x+1) is applied to each element of the tensor. The .mat
tensors used in these experiments can be generated by downloading the associated .tns
file from FROSTT and running the conversion script:
conversion_script_uber.m
- Still need to copy off Kahunaconversion_script_enron.m
conversion_script_amazon.m
- Still need to copy off Kahunaconversion_script_reddit.m
- Still need to copy off Kahuna
The .csv
result files are currently only in the source for the paper. They can be produced by running the post script on the associated artifact files.
These runs show that combining repeated rows can significantly decrease the time dedicated to solving the sampled system. This approach does not change the sampling methodology nor the solution; instead, it merely introduces a preprocessing step that removes redundancies in the linear system. The runs compare four methods - combing rows (C) with not combining (N) for both random (R) and hybrid (H) sampling schemes for least squares sketching - on a rank 10 solution of the Amazon tensor.
epsilon_runs_nofit.m
- Sub-function for performing repeated solves of a sampled linear system. Hard-coded to not calculate the full residual [this is now an option in epsilon_runs and should be deprecated when re-factoring].epsilon_runs_amazon.m
- Calls epsilon_runs_nofit for the Amazon tensor.epsilon_post_amazon.m
- Produces data filedata-amazon-combinetime.csv
Artifacts from runs and post:
epsilon_runs_amazon-results.mat
- Results fileepsilon_runs_amazon_diary.txt
- Log filedata-amazon-combinetime.csv
- Post-processed data
These runs illustrate the differences between random and hybrid sampling using a rank 10 solution to the Uber tensor.
epsilon_runs.m
- Sub-function for performing repeated solves of a sampled linear system.epsilon_runs_uber.m
- Calls epsilon_runs for the uber tensor.epsilon_post_uber.m
- Produces data filedata-uber-epsilon.csv
Artifacts from runs and post:
epsilon_runs_uber-results.mat
- Results fileepsilon_runs_uber_diary.txt
- Log filedata-uber-epsilon.csv
- Post-processed data
These runs compute the rank 25 CP tensor decomposition using matrix sketching on the Uber tensor and compare it to the solution found by CP-ALS. On this relatively small tensor, leverage score sampling is able to perform as well as the standard method.
als_runs_uber.m
- Runscp_als_trace.m
on the Uber tensor.arls_runs_uber.m
- Runscp_arls_lev.m
on the Uber tensor with different number of samples.arls_post_uber.m
- Exports data from ARLS runs to.csv
file and generates temporary Matlab figures.
Artifacts from runs and post:
als_runs_uber-results.mat
- Results file from ALS runsals_runs_uber_diary.txt
- Log file from ALS runsarls_runs_uber-results.mat
- Results file from ARLS runsarls_runs_uber_diary.txt
- Log file from ARLS runsdata-uber-fittrace-raw.csv
- Raw time and fit values for each individual rundata-uber-fittrace-interpl.csv
- Interpolated fit curve across each set of 10 runsdata-uber-totaltimes.csv
- Total time for each set of 10 runsdata-uber-finalfits.csv
- Final fit for all runs
These runs demonstrates how leverage score sampling scales favorably for massive sparse tensors, comparing CP-ALS (standard) and CP-ARLS-Lev (random and hybrid) on the Amazon tensor.
als_runs_amazon.m
- Runscp_als_trace.m
on the Amazon tensor.arls_runs_amazon.m
- Runscp_arls_lev.m
on the Amazon tensor with different number of samples.arls_post_amazon.m
- Exports data from ARLS runs to.csv
file and generates temporary Matlab figures.
Artifacts from runs and post:
als_runs_amazon-results.mat
- Combined results file from ALS runs- The experiments were split into two parts of 5 runs each. The associated logs:
als_runs_amazon_2020_06_05_0950_diary.txt
als_runs_amazon_2020_06_08_1035_diary.txt
arls_runs_amazon-results.mat
- Results file from ARLS runsarls_runs_amazon_diary.txt
- Log file from ARLS runsdata-amazon-fittrace-raw.csv
- Raw time and fit values for each individual rundata-amazon-fittrace-interpl.csv
- Interpolated fit curve across 10 runs
These runs demonstrates how leverage score sampling scales favorably for massive sparse tensors, comparing CP-ALS (standard) and CP-ARLS-Lev (random and hybrid) on the Reddit tensor. Furthermore, the factors (produced in the next section) provide useful and interpretable summaries of trends in the data.
als_runs_reddit.m
- Runscp_als_trace.m
on the Reddit tensor.arls_runs_reddit.m
- Runscp_arls_lev.m
on the Amazon tensor with different number of samples.arls_post_reddit.m
- Exports data from ARLS runs to.csv
file and generates temporary Matlab figures .
Artifacts from runs and post:
als_runs_reddit-results.mat
- Combined results file from ALS runs- The experiments were split into ten parts of 1 runs each. The associated logs:
als_runs_reddit_2020_04_27_2258_diary.txt
als_runs_reddit_2020_04_27_2300_diary.txt
als_runs_reddit_2020_04_27_2301_diary.txt
als_runs_reddit_2020_04_27_2303_diary.txt
als_runs_reddit_2020_04_29_2220_diary.txt
als_runs_reddit_2020_06_01_0933_diary.txt
als_runs_reddit_2020_06_03_1213_diary.txt
als_runs_reddit_2020_06_08_1043_diary.txt
als_runs_reddit_2020_06_14_1542_diary.txt
als_runs_reddit_2020_06_15_2241_diary.txt
arls_runs_reddit-results.mat
- Combined results file from ARLS runs. This can be produced by running the scriptcombine_reddit_als.m
- The experiments were split into three parts. The associated logs:
arls_runs_reddit_2020_06_05_0946_diary.txt
arls_runs_reddit_2020_06_05_0951_diary.txt
arls_runs_reddit_2020_06_08_1038_diary.txt
data-reddit-fittrace-raw.csv
- Raw time and fit values for each individual rundata-reddit-fittrace-interpl.csv
- Interpolated fit curve across 10 runs
arls_model_reddit.m
- Saves out the best model for the reddit CP-ALS runs. Is hard-coded with the random seeds for finding this model.save_top_factors.m
- Extract relevant counts and labels from solution factors.viz_factors_reddit_frequency.m
- Plot top factors along with word frequency.viz_reddit_component.m
- Function to visualize a single component.
Three exemplary figures are shown in the main text (7.6 - 7.8). The remaining 22 factors are included in the appendix (E.1 - E.22).
These runs use the Enron tensor to illustrate how performance can be improved for some tensors via randomized range finder (RRF) initialization.
als_runs_enron.m
- Runscp_als_trace.m
on the Enron tensor.arls_runs_enron.m
- Runscp_arls_lev.m
on the Enron tensor with different number of samples, initialization methods.arls_post_enron.m
- Exports data from ARLS runs to.csv
file and generates temporary Matlab figures.
Artifacts from runs and post:
als_runs_enron-results.mat
- Results file from ALS runs (1 is the runs with RRF, 2 is the random initializations)als_runs_enron_diary.txt
- Log file from ALS runsarls_runs_enron-results.mat
- Results file from ARLS runsarls_runs_enron_diary.txt
- Log file from ARLS runsdata-enron-totaltimes.csv
- Total time for each set of 10 runsdata-enron-finalfits.csv
- Final fit for all runs
These runs compare the performance of CP-ARLS-Lev on a synthetic dense tensor to both the standard CP-ALS method and CP-ARLS (with and without mixing). The synthetic tensor is designed to have a number of concentrated leverage scores.
arls_runs_synthetic.m
- Generates a synthetic dense tensor and runs all 4 methods with different number of samples.arls_post_synthetic_dense.m
- Exports data from ARLS runs to.csv
file and generates temporary Matlab figures.
Artifacts from runs and post:
arls_runs_synthetic_dense-results.mat
- Results fileals_runs_synthetic_dense_diary.txt
- Log file from runsdata-synthetic-dense-order3.dat
- Total time for each set of 10 runs
Sub-functions for epsilon_runs.m
:
full_solve.m
- Does the full solve, to be compared to the sampled solution.full_relerr.m
- Calculates the full-full and full-sampled residuals, sharing some info between the two (like the MTTKRP) and then calculates the relative error (aka gamm).full_resid.m
- Helper forfull_relerr.m
. Not meant to be run on its own.
Sub-functions for post scripts:
interpl_plus.m
: Enhanced interpolation function that extends completed runs at their final value.write_vector.m
: Writes out data in an interpretable manner for pgfplots.
rescaling_test_script.m
- Just some very simple probabilistic sum tests to try and recreate and then correct the issues with some strange distributions (i.e., lots of entries mostly zero) and poor performance of the method.script_dampling_factors.m
- Testing fordamping_factors.m
matrix_orientation_tests.m
- Various tests to see the best orientation for the sampled matricized tensor when the tensor is sparse. Also compares the MATLAB built-in solvers to Brett's QR code, which is about 5X faster in these tests.script_sampled_solve.m
- Rudimentary tests forsampled_solve.m
- IMPORTANT: There used to be a bug in saving out the results such that
info.finalfit
is the estimated fit andinfo.tf_finalfit
is the true fit calculated at the end. This is easy to fix, but all the current results were saved out this way. - If multiple results files were combined into one, the dates were left on the associated diaries. The result files were combined in chronological order.
tt_sample_norm.m
: Samples elements proportional to their norm. Used for dense experiments with concentrated leverage scores.cp_als_trace.m
andcp_arls_trace.m
: Version of these functions that also return the time and fit trace. Should eventually be merged in tocp_als.m
andcp_arls.m
.
There are generally two things to check if scripts do not run with the public branch of Tensor Toolbox: (1) Make sure the script calls cp_arls_lev
rather than cp_arls_leverage
. (2) Make sure truefit is set to the appropriate string and the arguement finalfit is deleted.
These are now available in the public branch. Still need to be merged into master.
cp_arls_lev.m
- Performs the CP-ARLS algorithm with leverage score sampling.tt_leverage_scores
- Calculates the SVD of a matrix and returns the leverage scorestt_sampled_solve.m
- Main sub-function that samples by leverage scores, forms the sampled linear system, and returns the solution. It has the following sub-functions:find_deterministic_krp.m
- For a given threshold, return all indices with probability above this threshold.sample_krp_leverage.m
- Sampler that uses damped leverage scores so that the maximum number of times that any particular multiindex is sampled is bounded. Pass inc=[]
to sampled with original leverage scores.extract_krp_fibers.m
- Extracts fibers from KRP matrix.damping_factors.m
- Helper function forsample_krp_leverage
. This functionality was not used for any of the experiments in this paper.
- New functions for class
@sptensor
extract_sampled_fibers.m
- Creates matrix from sampled fibersfiber_indices.m
- Computes the fiber indices in mode k for all nonzerosrrf.m
- Takes in a tensor and set of linear indices for one of the mode unfoldings. Outputs an initialization U formed by random linear combinations of the fibers.
- New functions for class
@tensor
extract_sampled_fibers.m
- Creates matrix from sampled fibers