Skip to content

MADlib Getting Started Guide

Rahul Iyer edited this page Jul 1, 2015 · 3 revisions

Installation

Steps for Installing MADlib®:

1. Download the MADlib binary

Postgres:

  • on OSX download the .dmg binary from madlib.net
  • on Redhat/CentOS download the .rpm binary from madlib.net

Pivotal Greenplum Database:

  • on Redhat/CentOS download the .gppkg binary from subscribe.net

2. Install the package at the OS level

Postgres:

  • on OSX double click the installer package

  • on Redhat / CentOS run the following as root:

    yum install <madlib_package> --nogpgcheck
    

Pivotal Greenplum Database:

  • on Redhat / CentOS run the following as gpadmin

            gppkg install <madlib_package>
    

3. Ensure that the environment is setup for you database deployment and that the database is up and running.

  • Ensure that psql, postgres, and pg_config are in your path

      which psql
      which postgres
      which pg_config
    
  • Ensure that the database is started and running

      psql –c 'select version()' -c
    

The above may need user/port/password setting depending on how the database has been configured.

4. Run the MADlib deployment utility to deploy madlib into each database that you want to use it in:

On Postgres:

/usr/local/madlib/bin/madpack –p postgres

On Pivotal Greenplum Database:

/usr/local/madlib/bin/madpack –p greenplum

The above may need user/port/password setting depending on how the database has been configured.

5. Test your installation to validate proper installation

On Postgres:

    /usr/local/madlib/bin/madpack –p postgres install-check

On Pivotal Greenplum Database:

    /usr/local/madlib/bin/madpack –p greenplum install-check

The above may need user/port/password setting depending on how the database has been configured.

Loading Data

MADlib is designed for in-database analytic modeling. There are multiple advantages of this approach including:

  • Leveraging all of the data management capabilities of the database: transaction consistency, multi user concurrency, data accesses control, etc.
  • Leveraging the innate schema metadata for structured data analysis.
  • When deployed against the Pivotal Greenplum Database it can automatically make use of the native distributed computing capabilities to transparently scale the processing to a high-performance computing cluster.

The data loading process itself is the same as any other database loading operation and can leverage many existing ETL tools.

There are a couple structures used by MADlib that are worth mentioning including:

Arrays

MADlib does make use of the native ARRAY of the database engine and utilizes this capability in many algorithms. Often the independent variable list is passed to a training function as an array either explicitly as an array expression or implicitly as a column that happens to be an array.

Example 1:

SELECT *
FROM madlib.linregr_train('input_table','output_table',
                          'y', 'array[1, x1, x2, x3]');

Example 2:

CREATE TABLE input_table_2 as
    SELECT y, array[1, x1, x2, x3] as x from input table;
SELECT *
FROM madlib.linregr_train('input_table_2','output_table', 'y','x');

The second form can sometimes lead to faster model training because it avoids having to convert data into an array during the training process.

The second way that arrays are found in madlib is that they are often used as the output from madlib functions as well. In the case of linear regression the output table includes arrays for the coefficients, standard errors, p-values, etc. Sometimes this can lead to output that can be difficult to directly interpret and it can be helpful to look at the output either using extended mode formatting or to explicitly unnest the arrays.

Example 1 – Utilizing extended display

\x on
SELECT * FROM output_table;
\x off

result:
-[ RECORD 1 ]+----------------------------------------------------------------------------
coef         | {46312.6200010276,28.0425364522841,-33109.6856122434,85.7048457939904}
r2           | 0.811542286900607
std_err      | {40885.2454316128,14.2433210730314,19630.7062616635,33.1022531471901}
t_stats      | {1.13274653269461,1.96882007422977,-1.68662732613461,2.5890940236878}
p_values     | {0.281407589038714,0.0746820465711091,0.119794838361157,0.0251788501326632}
condition_no | 12500.8979313625

Example 2 – Utilizing array unnesting

SELECT
   unnest(array['intercept', 'x1', 'x2','x3']) as feature,
   unnest(coef) as coef,
   unnest(std_err) as std_err,
   unnest(t_stats) as t_statistic,
   unnest(p_values) as p_value
FROM output_table;

Result:

  feature  |       coef        |     std_err      |    t_statistic    |      p_value
-----------+-------------------+------------------+-------------------+--------------------
 intercept |  46312.6200010276 | 40885.2454316128 |  1.13274653269461 |  0.281407589038714
 x1        |  28.0425364522841 | 14.2433210730314 |  1.96882007422977 | 0.0746820465711091
 x2        | -33109.6856122434 | 19630.7062616635 | -1.68662732613461 |  0.119794838361157
 x3        |  85.7048457939904 | 33.1022531471901 |   2.5890940236878 | 0.0251788501326632
(4 rows)

Sparse Vectors

MADlib also has capabilities for storing large sparse arrays in a much more efficient manner through a datatype it provides directly called sparse vectors.

Sparse Vectors are displayed in a run-length encoded fashion like so:

'{1,5,1}:{0,1,0}'

Which can be interpreted as 1 occurrence of value 0 followed by 5 occurrences of value 1 followed by 1 occurrence of value 0.

Which is to say that the above sparse vector is equivalent to the following array:

Array[0,1,1,1,1,1,0]

This is particularly used for methods that expect very large feature spaces as input. In particular many algorithms intended for text processing expect very large vectors of term counts that can be represented either using arrays or sparse vectors for input. The space utilization for sparse vectors can be much more efficient.

See Sparse Vectors for more information.

An introduction to machine learning with MADlib

Understanding the concepts of Machine Learning

In general machine learning is a technique for deriving insights from data. This is achieved by taking various samples of the initial input data, analyzing its statistical properties and building a model that tries to summarize those properties in a way that best approximates the underlying data.

We can roughly separate learning problems into two discrete categories:

Supervised learning:

In supervised learning we are trying to derive how one value, often called a dependent variable, can be predicted based on other known values referred to as either independent variables, or as features. The model itself must be trained on a set of input where the value of both the dependent variables and independent variables are known. The result of the training is a model that can be used to predict future values of the dependent variable based only on new values of the independent variables.

Within supervised learning there are two common subdivisions:

Classification Methods

When the desired output is categorical in nature we use classification methods to build a model that predicts which of the various categories a new result would fall into. The goal of classification is to be able to correctly label incoming records with the correct class for the record.

Example: If we had data that described various demographic data and other features of individuals applying for loans and we had historical data that included what past loans had defaulted, then we could build a model that described the likelihood that a new set of demographic data would result in a loan default. In this case the categories are "will default" or "won't default" which are two discrete classes of output.

Regression Methods

When the desired output is continuous in nature we use regression methods to build a model that predicts the output value.

Example: If we had data that described properties of real estate listings then we could build a model to predict the sale value for homes based on the known characteristics of the houses. This is a regression because the output response is continuous in nature rather than categorical.

Unsupervised Learning

In unsupervised learning the training data does not include a dependent variable, rather the goal is trying to find structure without having an underlying assumption going in what that structure may be. There are several different types of models that fall into this category including:

Clustering Methods

In which we are trying to identify groups of data such that the items within one cluster are more similar to each other than they are to the items in any other cluster.

Example: In customer segmentation analysis the goal is to identify specific groups of customers that behave in a similar fashion so that various marketing campaigns can be designed to reach these markets. When the customer segments are known in advance this would be a supervised classification task, when we let the data itself identify the segments this is a clustering task.

Topic Modeling Methods

Topic modeling is similar to clustering in that it attempts to identify clusters of documents that are similar to each other, but it is more specialized in a text domain where it is also trying to identify the main themes of those documents.

Association Rule Mining

Also called market basket analysis or frequent itemset mining, this is attempting to identify which items tend to occur together more frequently than random chance would indicate suggesting an underlying relationship between the items.

Example: In an online web store association rule mining can be used to identify what products tend to be purchased together. This can then be used as input into a product recommender engine to suggest items that may be of interest to the customer and provide upsell opportunities.

Descriptive Statistics and Validation

Descriptive Statistics

Descriptive statistics don't provide a model and thus are not considered a learning method, but they can be helpful in providing information to an analyst to understand the underlying data and can provide valuable insights into the data that may influence choice of data model.

Example: Calculating the distribution of data within each variable of a dataset can help an analyst understand which variables should be treated as categorical variables and which as continuous variables as well as understanding what sort of distribution the values fall in.

Validation

Using a model without understanding the accuracy of the model can lead to disastrous consequences. For that reason it is important to understand the error of a model and to evaluate the model for accuracy on testing data. Frequently in data analysis a separation is made between training data and testing data solely for the purpose of providing statistically valid analysis of the validity of the model and assessment that the model is not overfitting the training data. N-fold cross validation is also frequently utilized.

Clone this wiki locally