Skip to content

Releases: madlib/archived_madlib

MADlib v1.8

21 Mar 22:49
Compare
Choose a tag to compare

Release Date: 2015-July-17

New features:

  • Improved Latent Dirichlet Allocation (LDA) Performance
    • Function lda_train() is about twice as fast.
    • Improved the scalability of the function
      (vocabulary size x number of topics can be up to 250 million).
  • New module: Matrix operations
    Added the following operations/functions for dense and sparse matrices:
    • Mathematical operations: addition, subtraction, multiplication,
      element-wise multiplication, scalar and vector multiplication.
    • Aggregation operations: apply various operations including
      max, min, sum, mean along a specified dimension.
    • Visitor methods: extract row/column from matrix.
    • Representation: convert a matrix to either dense or sparse representation.
  • Quotation and International Character Support
    • Most modules now support table and column names that are quoted and
      contain international characters, including:
      • Regression models (GLMs, linear regression, elastic net, etc.)
      • Decision trees and random forests
      • Unsupervised learning models (association rules, k-means, LDA, etc.)
      • Summary, Pearson's correlation, and PCA
  • Array Norms and Distances
    • Generic p-norm distance
    • Jaccard distance
    • Cosine similarity
  • Text Analysis:
    • Text utility for term frequency and vacabulary construction (prepares
      documents for input to LDA).
  • Miscellaneous
    • Improved organization of User and Developer guide at doc.madlib.net/latest.
    • Low-rank matrix factorization: added 32-bit integer aupport (MADLIB-903).
    • Cross-validation: added classification support (MADLIB-908).
    • Added a new clean-up function for removing MADlib temporary tables.

Note:

  • LDA models that are trained using MADlib v1.7.1 or earlier need to be
    re-trained to be used in MADlib v1.8.

Known issues:

  • Performance for decision tree with cross-validation is poor on a HAWQ
    multi-node system.

MADlib v1.7.1

21 Mar 22:52
Compare
Choose a tag to compare

Release Date: 2015-March-18

New features:

  • Random Forest Performance Improvement
    • Function forest_train() is 1.5X ~ 4X faster without variable importance,
      and up to 100X faster with variable importance
    • Function forest_predict() is up to 10X faster when type='response'
    • Allow user-specified sample ratio to train with a small subsample
  • Gaussian Naive Bayes: allow continuous variables
  • K-Means: Allow user-specified sample ratio for K-means++ seeding
  • Miscellaneous
    • Array functions: array_square() for element-wise square, madlib.sum()
      for array element-wise aggregation
    • Madpack does not require password when not necessary (MADLIB-357)
    • Platform support of PostgreSQL 9.4 and HAWQ 1.3
    • Allow views and materialized views for training functions
    • Support quantile computation in summary functions for HAWQ and PG 9.4

Bug fixes:

  • Fixed the support of multiple parameter values and NULL in general
    cross-validation (MADLIB-898, MADLIB-896)
  • Fixed infinite loop when detecting recursive view-to-view dependencies for
    upgrading (MADLIB-901)
  • Allow user-specified column names in PCA and multinom_predict()

Known issues:

  • Performance for decision tree with cross-validation is poor on a HAWQ
    multi-node system.

MADlib v1.7

22 Mar 02:23
Compare
Choose a tag to compare

Release Date: 2014-December-31

New features:

  • Generalized Linear Model:
    • Added a new generic module for GLM functions that allow for response
      variables that have arbitrary distributions (rather than simply
      Gaussian distributions), and for an arbitrary function of the response
      variable (the link function) to vary linearly with the predicted values
      (rather than assuming that the response itself must vary linearly).
    • Available distribution families: gaussian (link functions: identity,
      inverse and log), binomial (link functions: probit and logit),
      poisson (link functions: log, identity and square-root), gamma (link
      functions: inverse, identity and log) and inverse gaussian (link functions:
      square-inverse, inverse, identity and log).
    • Deprecated 'mlogregr_train' in favor of 'multinom' available as part of
      the new GLM functionality.
    • Added a new 'ordinal' function for ordered logit and probit regression.
  • Decision Tree: Reimplemented the decision tree module which includes following
    changes:
    • Improved usability due to a new interface.
    • Performance enhancements upto 40 times faster than the old interface.
    • Additional features like pruning methods, surrogate variables for
      NULL handling, cross validation, and various new tree tuning parameters.
    • Addition of a new display function to visualize the trained tree and new
      prediction function for scoring of new datasets.
  • Random Forest: Reimplemented the random forest module which includes following
    changes:
    • New random forest module based on the new decision tree module.
    • Better variable importance metrics and ability to explore each tree
      in the forest independently.
    • Ability to get class probabilities of all classes and not just the max
      class during prediction.
    • Improved visualization with export capabilities using Graphviz dot format.
  • PMML:
    • Upgraded compatible PMML version to 4.1.
    • Moved PMML export out of early stage development with new functionality
      available to export GLM, decision tree, and random forest models.
  • Updated Eigen from 3.1.2 to 3.2.2.
  • Updated PyXB from 1.2.3 to 1.2.4.
  • Added finer granularity control for running specific install-check tests.

Bug fixes:

  • Fixed bug in K-means allowing use of user-defined metric functions
    (MADLIB-874, MADLIB-875).
  • Fixed issues related to header files included in the build system
    (MADLIB-855, MADLIB-879, MADLIB-884).

Known issues:

  • Performance for decision tree with cross-validation is poor on a HAWQ
    multi-node system.

MADlib v1.6

22 Mar 02:29
Compare
Choose a tag to compare

Release Date: 2014-June-30

New features:

  • Added a new unified 'margins' function that computes marginal effects for
    linear, logistic, multilogistic, and cox proportional hazards regression. The
    new function also introduces support for interaction terms in the independent
    array.
  • Updated convergence for 'elastic_net_train' by checking the change in the
    loglikelihood instead of the l2-norm of the change in coefficients. This allows
    for faster convergence in problems with multiple optimal solutions.
    The default threshold for convergence has been reduced from 1e-4 to 1e-6.
  • Added a new helper function to convert categorical variables to indicator
    variables which can be used directly in regression methods. The function
    currently only supports dummy encoding.
  • Improved performance for cox proportional hazards: average improvement of
    20 fold on GPDB and 2.5 fold on HAWQ.
  • Improved performance on ARIMA by 30%.
  • Added new functionality to export linear and logistic regression models as a
    PMML object. The new module relies on PyXB to create PMML elements.
  • Added a function ('array_scalar_add') to 'add' a scalar to an array.
  • Added 'numeric' type support for all functions that take 'anyarray' as
    argument.
  • Made usability and aesthetic enhancements to documentation.

Bug Fixes:

  • Prepended python module name to sys.path before executing madlib function
    to avoid conflicts with user-defined modules.
  • Added a check in K-Means to ensure dimensionality of all data points are
    the same and also equal to the dimensionality of any provided initial centroids
    (MADLIB-713, MADLIB-789).
  • Added a check in multinomial regression to quit early and cleanly if model
    size is greater than the maximum permissible memory (MADLIB-667).
  • Fixed a minor bug with incorrect column names in the decision trees module
    (MADLIB-763).
  • Fixed a bug in Kmeans that resulted in incorrect number of centroids for
    particular datasets (MADLIB-857).
  • Fixed bug when grouping columns have same name as one of the output table
    column names (MADLIB-833).

Deprecated Functions:

  • Modules profile and quantile have been deprecated in favor of the 'summary'
    function.
  • Module 'svd_mf' has been deprecated in favor of the improved 'svd' function.
  • Functions 'margins_logregr' and 'margins_mlogregr' have been deprecated in
    favor of the 'margins' function.

MADlib v1.5

22 Mar 02:35
Compare
Choose a tag to compare

Release Date: 2014-Mar-05

New features:

  • Added a new port 'HAWQ'. MADlib can now be used with the Pivotal
    Distribution of Hadoop (PHD) through HAWQ
    (see http://www.gopivotal.com/big-data/pivotal-hd for more details).
  • Implemented performance improvements for linear and logistic predict functions.
  • Moved Conditional Random Fields (CRFs) out of early stage development, and
    updated the design and APIs for to enable ease of use and better functionality.
    API changes include lincrf replaced by lincrf_train, crf_train_fgen and
    crf_test_fgen with updated arguments, and format of segment tables.
  • Improved linear support vector machines (SVMs) by enabling iterations, and
    removed lsvm_predict and svm_predict, which are not useful in GPDB and HAWQ.
  • Added new functions, with improved performance compared to svec_sfv, for
    document vectorization into sparse vectors.
  • Removed the bool-to-text cast and updated all functions depending on it to
    explicitly convert variable to text.
  • Added function properties for all SQL functions to allow the database optimizer
    to make better plans.

Bug Fixes:

  • Set client_min_messages to 'notice' during database installation to ensure
    that log messages don't get logged to STDERR.
  • Fixed elastic net prediction to predict using all features instead of just
    the selected features to avoid an error when no feature is selected as relevant
    in the trained model.
  • For corner probability values, p=0 and p=1, in bernoulli and binomial
    distributions, the quantile values should be 0 and num_of_trials (=1 in the case
    of bernoulli) respectively, independent of the probability of success.
  • Changed install script to explicitly use /bin/bash instead of /bin/sh to avoid
    problems in Ubuntu where /bin/sh is linked to 'dash'.
  • Fixed issue in Elastic Net to take any array expression as input instead of
    specifically expecting the expression 'ARRAY[...]'.
  • Fixed wrong output in percentile of count-min (CM) sketches.

Known issues:

  • Elastic net prediction wrapper function elastic_net_prediction is not
    available in HAWQ. Instead, prediction functionality is available for both
    families via elastic_net_gaussian_predict and elastic_net_binomial_predict.
  • Distance metrics functions in K-Means for the HAWQ port are restricted to the
    in-built functions, specifically squaredDistNorm2, distNorm2, distNorm1,
    distAngle, and distTanimoto.
  • Functions in Quantile and Profile modules of Early Stage Development are not
    available in HAWQ. Replacement of these functions is available as built-in
    functions (percentile_cont) in HAWQ and Summary module in MADlib, respectively.

MADlib v1.4.1

22 Mar 18:58
Compare
Choose a tag to compare

Release Date: 2013-Dec-13

Bug Fixes:

  • Fixed problem in Elastic Net for 'binomial' family if an 'integer' column was
    passed for dependent variable instead of a 'boolean' column.
  • '' support in Elastic Net lacked checks for the columns being combined. Now
    we check if the column for '
    ' is already an array, in which case we don't wrap
    it with an 'array' modifier. If there are multiple columns we check that they
    are of the same numeric type before building an array.
  • Fixed a software regression in Robust Variance, Clustered Variance and
    Marginal Effects for multinomial regression introduced in v1.4 when
    output table name is schema-qualified.
  • We now also support schema-qualified output table prefixes for SVD and PCA.
  • Added warning message when deprecated functions are run. Also added a list of
    deprecated functions in the ReadMe.
  • Added a Markdown Readme along with the text version for better rendering on
    Github.

MADlib v1.4

22 Mar 19:00
Compare
Choose a tag to compare

Release Date: 2013-Nov-25

New Features:

  • Improved interface for Multinomial logistic regression:
    • Added a new interface that accepts an 'output_table' parameter and
      stores the model details in the output table instead of returning as a struct
      data type. The updated function also builds a summary table that includes
      all parameters and meta-parameters used during model training.
    • The output table has been reformatted to present the model coefficients
      and related metrics for each category in a separate row. This replaces the
      old output format of model stats for all categories combined in a
      single array.
  • Variance Estimators
    • Added Robust Variance estimator for Cox PH models (Lin and Wei, 1989).
      It is useful in calculating variances in a dataset with potentially
      noisy outliers. Namely, the standard errors are asymptotically normal even
      if the model is wrong due to outliers.
    • Added Clustered Variance estimator for Cox PH models. It is used
      when data contains extra clustering information besides covariates and
      are asymptotically normal estimates.
  • NULL Handling:
    • Modified behavior of regression modules to 'omit' rows containing NULL
      values for any of the dependent and independent variables. The number of
      rows skipped is provided as part of the output table.
      This release includes NULL handling for following modules:
      • Linear, Logistic, and Multinomial logistic regression, as well as
        Cox Proportional Hazards
      • Huber-White sandwich estimators for linear, logistic, and multinomial
        logistic regression as well as Cox Proportional Hazards
      • Clustered variance estimators for linear, logistic, and multinomial
        logistic regression as well as Cox Proportional Hazards
      • Marginal effects for logistic and multinomial logistic regression

Deprecated functions:
- Multinomial logistic regression function has been renamed to
'mlogregr_train'. Old function ('mlogregr') has been deprecated,
and will be removed in the next major version update.

- For all multinomial regression estimator functions (list given below),
changes in the argument list were made to collate all optimizer specific
arguments in a single string. An example of the new optimizer parameter is
'max_iter=20, optimizer=irls, precision=0.0001'.
This is in contrast to the original argument list that contained 3 arguments:
'max_iter', 'optimizer', and 'precision'. This change allows adding new
optimizer-specific parameters without changing the argument list.
Affected functions:
    - robust_variance_mlogregr
    - clustered_variance_mlogregr
    - margins_mlogregr

Bug Fixes:
- Fixed an overflow problem in LDA by using INT64 instead of INT32.
- Fixed integer to boolean cast bug in clustered variance for logistic
regression. After this fix, integer columns are accepted for binary
dependent variable using the 'integer to bool' cast rules.
- Fixed two bugs in SVD:
- The 'example' option for online help has been fixed
- Column names for sparse input tables in the 'svd_sparse' and
'svd_sparse_native' functions are no longer restricted to 'row_id',
'col_id' and 'value'.

MADlib v1.3

22 Mar 20:55
Compare
Choose a tag to compare

Release Date: 2013-October-03

New Features:

  • Cox Proportional Hazards:
    • Added stratification support for Cox PH models. Stratification is used as
      shorthand for building a Cox model that allows for more than one stratum,
      and hence, allows for more than one baseline hazard function.
      Stratification provides two pieces of key, flexible functionality for the
      end user of Cox models:
      -- Allows a categorical variable Z to be appropriately accounted for in
      the model without estimating its predictive impact on the response
      variable.
      -- Categorical variable Z is predictive/associated with the response
      variable, but Z may not satisfy the proportional hazards assumption
    • Added a new function (cox_zph) that tests the proportional hazards
      assumption of a Cox model. This allows the user to build Cox models and then
      verify the relevance of the model.
  • NULL Handling:
    • Modified behavior of linear and logistic regression to 'omit' rows
      containing NULL values for any of the dependent and independent variables.
      The number of rows skipped is provided as part of the output table.

Deprecated functions:
- Cox Proportional Hazard function has been renamed to 'coxph_train'.
Old function names ('cox_prop_hazards' and 'cox_prop_hazards_regr')
have been deprecated, and will be removed in the next major version update.
- The aggregate form of linear regression ('linregr') has been deprecated.
The stored-procedure form ('linregr_train') should be used instead.

Bug Fixes:
- Fixed a memory leak in the Apriori algorithm.

MADlib v1.2

22 Mar 20:55
Compare
Choose a tag to compare

Release Date: 2013-September-06

New Features:

  • ARIMA Timeseries modeling
    • Added auto-regressive integrated moving average (ARIMA) modeling for
      non-seasonal, univariate timeseries data.
    • Module includes a training function to compute an ARIMA model and a
      forecasting function to predict future values in the timeseries
    • Training function employs the Levenberg-Marquardt algorithm (LMA) to
      compute a numerical solution for the parameters of the model. The
      observations and innovations for time before the first timestamp
      are assumed to be zero leading to minimization of the conditional sum of
      squares. This produces estimates referred to as conditional maximum likelihood
      estimates (also referred as 'CSS' in some statistical packages).
  • Documentation updates:
    • Introduced a new format for documentation improving usability.
    • Upgraded to Doxygen v1.84.
    • Updated documentation improving consistency for multiple modules including
      Regression methods, SVD, PCA, Summary function, and Linear systems.
      Bug fixes:
    • Checking out-of-bounds access of a 'svec' even if the size of svec is zero.
    • Fixed a minor bug allowing use of GCC 4.7 and higher to build from source.

MADlib v1.1

22 Mar 21:02
Compare
Choose a tag to compare

Release Date: 2013-August-09

New Features:

  • Singular Value Decomposition:
    • Added Singular Value Decomposition using the Lanczos bidiagonalization
      iterative method to decompose the original matrix into PBQ^t, where B is
      a bidiagonalized matrix. We assume that the original matrix is too big to
      load into memory but B can be loaded into the memory. B is then further
      decomposed into XSY^T using Eigen's JacobiSVD function. This restricts the
      number of features in the data matrix to about 5000.
    • This implementation provides SVD (for dense matrix), SVD_BLOCK (also for
      dense matrix but faster), SVD_SPARSE (convert a sparse matrix into a
      dense one, slower) and SVD_SPARSE_NATIVE (directly operate on the sparse
      matrix, much faster for really sparse matrices).
  • Principal Component Analysis:
    • Added a PCA training function that generates the top-K principal
      components for an input matrix. The original data is mean-centered by the
      function with the mean matrix returned by the function as a separate table.
    • The module also includes the projection function that projects a test data
      set to the principal components returned by the train function.
  • Linear Systems:
    • Added a module to solve linear system of equations (Ax = b).
    • The module utilizes various direct methods from the Eigen library for
      dense systems. Given below is a summary of the methods (more details at
      http://eigen.tuxfamily.org/dox-devel/group__TutorialLinearAlgebra.html):
      • Householder QR
      • Partial Pivoting LU
      • Full Pivoting LU
      • Column Pivoting Householder QR
      • Full Pivoting Householder QR
      • Standard Cholesky decomposition (LLT)
      • Robust Cholesky decomposition (LDLT)
    • The module also includes direct and iterative methods for sparse linear
      systems:
      Direct:
      - Standard Cholesky decomposition (LLT)
      - Robust Cholesky decomposition (LDLT)
      Iterative:
      - In-memory Conjugate gradient
      - In-memory Conjugate gradient with diagonal preconditioners
      - In-memory Bi-conjugate gradient
      - In-memory Bi-conjugate gradient with incomplete LU preconditioners

Bug fixes and other changes:

  • Robust input validation:
    • Validation of input parameters to various functions has been improved to
      ensure that it does not fail if double quotes are included as part of the
      table name.
  • Random Forest
    • The ID field in rf_train has been expanded from INT to BIGINT (MADLIB-764)
  • Various documentation updates:
    • Documentation updated for various modules including elastic net, linear
      and logistic regression.