Skip to content

MADlib Benchmark Requirements

cwelton edited this page May 9, 2012 · 38 revisions

Intro

In order to well understand and be able to improve the performance of MADlib modules we need a proper benchmark framework. The purpose of this document is to lay out requirements for such a solution.

The main goals for MADlib benchmarks are:

  • Competitive comparison

    Our initial comparison targets should be R and Mahout, as it should be easy to setup corresponding tests on the same HW configuration. Initial testing could be performed on a single host.

  • Regression tests and other purposes

    We should keep a log book of current run-times and rerun the appropriate tests after any substantial modifications to existing modules. The summary of it should be available on the wiki.

  • Performance Improvements

    The benchmark results should eventually be used to detect bottlenecks and to fix them.

Non-Functional Requirements

In order to fulfill the above goals the MADlib benchmark tool should possess the following characteristics:

  • Portability

    It must run on all OS and DB platforms supported by MADlib. Moreover, it needs to be possible to test other analytics utilities using the same command-line interface.

  • Scalability

    It should be easy to scale the data size for a predefined test. This should apply to both tables dimensions as well as the run-time environment.

  • Reproducibility

    It should be possible to rerun the performance test with identical starting conditions. This applies to both the data generation as well as the execution phase. All test parameters need to be recorded.

  • Modularity

    It should be easy to add new performance tests for new or existing modules.

  • Automation

    Execution of the benchmark (full or per module) should be easy to automate.

Competitive Comparison

The initial comparison tests can be performed on a single machine using the following plan:

  • HW/OS platform: 64 bit Red Hat Enterprise Linux Server 5.5 with 16 CPU cores, 64GB RAM.
  • Test environments:
    • MADlib on Greenplum 4.1
    • MADlib on PostgreSQL 9.0
    • Alpine Miner on Greenplum 4.1
    • R
    • Hadoop/Mahout
  • Algorithms:
    • Naive-Bayes Classification: R, Mahout
      Scaling factor: number of classes, attributes, rows.

      Training set size factors: 
         nr of classes
         nr of attributes
         nr of rows
      Test 1: precompute class priors and feature probabilities, then score the data.
      Test 2: score the data w/o pre-computation of class priors and feature probabilities.
      
    • Linear Regression: R, Mahout

      Data set size factors: 
         nr of independent variables
         nr of rows
      Test: run the linear regression function.
      
    • Logistic Regression: R, Mahout

      Data set size factors: 
         nr of variables
         nr of rows
      Test: run the logistic regression function.
      
    • Decision Trees: R, Mahout

      Training set size factors: 
         nr of clases
         nr of features
         nr of rows
      Test: run the decision tree training and score the new data.
      
    • Support Vector Machines: R, Mahout

      Training set size factors: 
         nr of classes/labels
         nr of features/dimensions
         nr of rows/points
      Test: run the SVM training and data scoring for each of the following kernel functions:
         (1) regression, (2) classification, (3) novelty detection
      
    • Association Rules: R, Mahout

      Data set size factors: 
         nr of unique items
         nr of transactions
         max number of items per transaction
      Test: run the association rules function.
      
    • k-means Clustering: R, Mahout

      Data set size factors: 
         nr of points/rows
         nr of dimensions
         density/sparsity of data
      Test: run the k-means clustering function.
      
    • SVD Matrix Factorisation: R, Mahout

      Data set size factors: 
         matrix dimensions (rows, columns)
         density/sparsity of data
      Test: run the matrix factorisation function.
      
    • Latent Dirichlet Allocation: R, Mahout

      Data set size factors: 
         nr of documents
         nr of words per document
         size of the dictionary (nr of unique words)
      Test: run the LDA function.
      

Functional Requirements

Data generation:

  • It should be possible to generate data in parallel. For instance, on a Greenplum system, the external-table loading should be supported.
  • Whenever pseudo-random data is generated, the seed needs to be a parameter so that all tests are deterministic and reproducible.
  • For optimal performance, it is desirable that data generators can be implementable easily in compiled languages like C or C++.

Usage:

  • Usage of the performance benchmark tool should be similar to using madpack.

      madtest <madtest options> -p <platform> <platform options> -b <benchmark> <benchmark options>
    

    For instance, for benchmarking linear regression on Greenplum, one would write

      madtest -p greenplum -c localhost:5432/testdb -b LinearRegressionRandom --ivariables 10 --rows 20 --target_base_name bla
    
  • Options should be modular. E.g., it should be possible to add options to a particular benchmark without haveing to make any modifications to madtest.

Logging

  • Performance benchmark executions should be recorded for future reference, including as many parameters as possible that make each test reproducible.

High-level Design

The following class diagram shows the high-level design. A shaded background indicates interfaces that need to be implemented for the various combinations of analytics tools and modules.

For each analytics tool:

  • A test controller is implemented that drives data generation, data preprocessing/loading, and running the tests.

For each module:

  • One or more loggers are implemented to process run-time statistics (usually, this means writing to persistant storage).

For each pair of analytics tool and module:

  • A relation generator is implemented that outputs the benchmark relations in a format usable by the respective analytics tool. The code for the data generation should not be analytics-tool specific, but instead be shared among all analaytics tools.

  • A test executor is implemented that feeds the output of the relation generator into the analytics tool. For instance, for Greenplum, this includes setting up external tables whose source is the stdout stream of the relation generator. Then normal table would be created with the same data, and further preparation steps would take place (like running ANALYZE and flushing cashes). For other statistics software, the loading step is typically writing out the data in the appropriate input file format (which might be a binary format).

    After the loading, the benchmark test is executed, but calling the machine-learning algorithm with the newly-loaded data. The results and run-time statistics are then passed to the logger.

Class Diagram