NovaML.jl aims to provide a comprehensive and user-friendly machine learning framework written in Julia. Its objective is providing a unified API for various machine learning tasks, including supervised learning, unsupervised learning, and preprocessing, feature engineering etc.
Main objective of NovaML.jl is to increase the usage of Julia in daily data science and machine learning activities among students and practitioners.
Currently, the module and function naming in NovaML is similar to that of Scikit Learn to provide a familiarity to data science and machine learning practitioners. But NovaML is not a wrapper of ScikitLearn.
- Unified API using Julia's multiple dispatch and functor-style callable objects
- Algorithms for classification, regression, and clustering
- Preprocessing tools for data scaling, encoding, and imputation
- Model selection and evaluation utilities
- Ensemble methods
You can install NovaML.jl using Julia's package manager. From the Julia REPL, type ]
to enter the Pkg REPL mode and run:
pkg> add NovaML
The most prominent feature of NovaML is using functors (callable objects) to keep parameters as well as training and prediction. Assume model
represents a supervised algorithm. The struct model
keeps learned parameters and hyperparameters. It also behave as a function.
model(X, y)
trains the model.model(Xnew)
calculates the predictions forXnew
.
Here's a quick example of how to use NovaML.jl for a binary classification task:
using NovaML.Datasets
X, y = load_iris(return_X_y=true)
using NovaML.ModelSelection
Xtrn, Xtst, ytrn, ytst = train_test_split(X, y, test_size=0.2)
# Scale features
using NovaML.PreProcessing
scaler = StandardScaler()
scaler.fitted # false
# Fit and transform
Xtrnstd = scaler(Xtrn)
# transform with the fitted model
Xtststd = scaler(Xtst)
# Train a model
using NovaML.LinearModel
lr = LogisticRegression(η=0.1, num_iter=100)
using NovaML.MultiClass
ovr = OneVsRestClassifier(lr)
# Fit the model
ovr(Xtrnstd, ytrn)
# Make predictions
ŷtrn = ovr(Xtrnstd)
ŷtst = ovr(Xtststd)
# Evaluate the model
using NovaML.Metrics
acc_trn = accuracy_score(ytrn, ŷtrn);
acc_tst = accuracy_score(ytst, ŷtst);
println("Training accuracy: $acc_trn")
println("Test accuracy: $acc_tst")
# Training accuracy: 0.9833333333333333
# Test accuracy: 0.9666666666666667
load_auto_mpg
: Provides the Auto MPG dataset, a regression problem focused on predicting fuel efficiency. The dataset contains information about various automobiles from 1970-1982, featuring 7 characteristics including weight, horsepower, and acceleration, making it excellent for studying relationships between car attributes and fuel consumption.load_boston
: Loads the Boston Housing dataset, a classic regression problem. It contains information about housing in the Boston area, with 13 features and a target variable representing median home values.load_iris
: Provides access to the famous Iris flower dataset, useful for classification tasks. It includes 150 samples with 4 features each, categorized into 3 different species of Iris.load_breast_cancer
: Loads the Wisconsin Breast Cancer dataset, a binary classification problem. It contains features computed from digitized images of breast mass, with the goal of predicting whether a tumor is malignant or benign.load_wine
: Offers the Wine recognition dataset, suitable for multi-class classification. It includes 13 features derived from chemical analysis of wines from three different cultivars in Italy.make_blobs
: Generates isotropic Gaussian blobs for clustering or classification tasks. This function allows you to create synthetic datasets with a specified number of samples, features, and centers, useful for testing and benchmarking algorithms.make_moons
: Generates a 2D binary classification dataset in the shape of two interleaving half moons. This synthetic dataset is ideal for visualizing and testing classification algorithms, especially those that can handle non-linear decision boundaries.
using NovaML.Datasets
# Load data as a dictionary
data = load_boston()
#Dict{String, Any} with 4 entries:
# "feature_names" => ["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", …
# "data" => [-0.234473 0.498748 … 0.0908246 -0.252759; -0.916107 -2.407…
# "target" => [-4.96729, 1.0265, -4.11056, -9.52761, 3.43768, -2.64256, 3…
# "DESCR" => "Boston House Prices dataset"
# Load X and y separately
X, y = load_boston(return_X_y=true)
StandardScaler
: Standardize features by removing the mean and scaling to unit varianceMinMaxScaler
: Scale features to a given rangeLabelEncoder
: Encode categorical features as integersOneHotEncoder
: Encode categorical features as one-hot vectorsPolynomialFeatures
: Generate polynomial and interaction features up to a specified degree
SimpleImputer
: A basic imputation transformer for filling in missing values in datasets using strategies such as mean, median, most frequent, or constant value.
CountVectorizer
: Convert a collection of text documents to a matrix of token counts, useful for text feature extractionTfidfVectorizer
: Transform a collection of raw documents to a matrix of TF-IDF features, combining the functionality ofCountVectorizer
with TF-IDF weighting
Adaline
: Adaptive Linear NeuronElasticNet
: Linear regression with combined L1 and L2 priors as regularizer, balancing between Lasso and Ridge modelsLasso
: Linear Model trained with L1 prior as regularizer, useful for producing sparse modelsLinearRegression
: Linear regression algorithmLogisticRegression
: Binary and multiclass logistic regressionPerceptron
: Simple perceptron algorithmRANSACRegression
: Robust regression using Random Sample Consensus (RANSAC) algorithm. It's particularly effective for fitting models in the presence of significant outliers in the data.Ridge
: Linear regression with L2 regularization, useful for dealing with multicollinearity in data
DecisionTreeClassifier
: Decision tree for classificationDecisionTreeRegressor
: Decision tree for regression
AdaBoostClassifier
: An ensemble method that sequentially applies a base classifier to reweighted versions of the training data, giving more emphasis to incorrectly classified instances in subsequent iterationsBaggingClassifier
: A meta-estimator that fits base classifiers on random subsets of the original dataset and aggregates their predictions to form a final prediction.GradientBoostingClassifier
: An ensemble method that builds an additive model in a forward stage-wise fashion, allowing for the optimization of arbitrary differentiable loss functions. It uses decision trees as base learners and combines them to create a strong predictive model.RandomForestClassifier
: An ensemble learning method that constructs multiple decision trees during training and outputs the class that is the mode of the classes of the individual trees.RandomForestRegressor
: An ensemble method that builds multiple decision trees for regression tasks and predicts by averaging their outputs. It combines bagging with random feature selection to create a robust, accurate model that often resists overfitting.VotingClassifier
: A classifier that combines multiple machine learning classifiers and uses a majority vote or the average predicted probabilities to predict the class labels.
KNeighborsClassifier
: K-nearest neighbors classifier
LatentDirichletAllocation
: A generative statistical model that allows sets of observations to be explained by unobserved groups. It's commonly used for topic modeling in natural language processing.PCA
: Principal Component Analysis, a dimensionality reduction technique that identifies the axes of maximum variance in high-dimensional data and projects it onto a lower-dimensional subspace.
accuracy_score
: Calculates the accuracy classification score, i.e., the proportion of correct predictions.auc
: Computes the Area Under the Curve (AUC) for the Receiver Operating Characteristic (ROC) curve, evaluating the overall performance of a binary classifier.confusion_matrix
: Computes a confusion matrix to evaluate the accuracy of a classification. It shows the counts of true positive, false positive, true negative, and false negative predictions.mean_absolute_error
,mae
: Computes the average absolute difference between estimated and true values. This metric is robust to outliers and provides a linear measure of error.mae
is an alias formean_absolute_error
.mean_squared_error
,mse
: Computes the average squared difference between estimated and true values.mse
is an alias formean_squared_error
.r2_score
: Calculates the coefficient of determination (R²), measuring how well future samples are likely to be predicted by the model.adj_r2_score
: Computes the adjusted R² score, which accounts for the number of predictors in the model, penalizing unnecessary complexity.f1_score
: Computes the F1 score, which is the harmonic mean of precision and recall, providing a balance between the two.matthews_corcoef
: Calculates the Matthews correlation coefficient (MCC), a measure of the quality of binary classifications, considering all four confusion matrix categories.precision_score
: Computes the precision score, which is the ratio of true positive predictions to the total predicted positives.recall_score
: Computes the recall score, which is the ratio of true positive predictions to the total actual positives.roc_auc_score
: Computes the Area Under the Receiver Operating Characteristic Curve (ROC AUC), providing an aggregate measure of classifier performance.roc_curve
: Produces the values (fpr, tpr) to plot the Receiver Operating Characteristic (ROC) curve, showing the trade-off between true positive rate and false positive rate at various threshold settings.silhouette_samples
: Computes the silhouette coefficient for each sample in a dataset, measuring how similar an object is to its own cluster compared to other clusters. This metric is useful for evaluating the quality of clustering algorithms.
cross_val_score
: Apply cross validation scoreGridSearchCV
: Perform exhaustive search over specified parameter values for an estimator.learning_curve
: Generate learning curves to evaluate model performance as a function of the number of training samples, helping to diagnose bias and variance problemsRandomSearchCV
: Perform randomized search over specified parameter distributions for an estimator. RandomSearchCV is often more efficient than GridSearchCV for hyperparameter optimization, especially when the parameter space is large or when some parameters are more important than others.StratifiedKFold
: Provides stratified k-fold cross-validator, ensuring that the proportion of samples for each class is roughly the same in each foldtrain_test_split
: Split arrays or matrices into random train and test subsetsvalidation_curve
: Determine training and validation scores for varying parameter values, helping to assess how a model's performance changes with respect to a specific hyperparameter and aiding in hyperparameter tuning
using Plots
using NovaML.LinearModel: LogisticRegression
using NovaML.Metrics: roc_curve, auc
lr = LogisticRegression(random_state=1, solver=:lbfgs, λ=0.01)
lr(Xtrn, ytrn)
ŷ = lr(Xtst, type=:probs)[:, 2]
fpr, tpr, _ = roc_curve(ytst, ŷ)
roc_auc = auc(fpr, tpr)
plot(fpr, tpr, color=:blue,
linewidth=2,
title="Receiver Operator Characteristic (ROC) Curve",
xlabel="False Positive Rate",
ylabel="True Positive Rate",
label="AUC: $(round(roc_auc, digits=2))")
plot!([0, 1], [0, 1], color=:red,
linestyle=:dash, label=nothing, linewidth=2)
MulticlassPerceptron
: An extension of the binary perceptron algorithm for multi-class classification problems. It learns a linear decision boundary for each class and updates weights based on misclassifications.OneVsRestClassifier
: A strategy for multi-class classification that fits one binary classifier per class, treating the class as positive and all others as negative. It's versatile and can be used with any base binary classifier.
You can use ensemble methods like Random Forest for improved performance:
using NovaML.Ensemble: RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, max_depth=5)
rf(Xtrn, ytrn)
ŷ = rf(Xtst)
SVC
: Support Vector Classifier. Binary classification which supports linear and RBF kernels. Doesn't support multiclass classification yet.
using NovaML.SVM: SVC
# Create an SVC instance
svc = SVC(kernel=:rbf, C=1.0, gamma=:scale)
# Train the model
svc(X_train, y_train)
# Make predictions
ypreds = svc(X_test)
AgglomerativeClustering
: A hierarchical clustering algorithm that builds nested clusters by merging or splitting them successively. This bottom-up approach is versatile and can create clusters of various shapes.DBSCAN
: Density-Based Spatial Clustering of Applications with Noise, a density-based clustering algorithm that groups together points that are closely packed together, marking points that lie alone in low-density regions as outliers.KMeans
: A popular and simple clustering algorithm that partitions n observations into k clusters, where each observation belongs to the cluster with the nearest mean (cluster centroid). It's efficient for large datasets but assumes spherical clusters of similar size.
Use PCA for dimensionality reduction:
using NovaML.Decomposition: PCA
pca = PCA(n_components=2)
# fit
pca(X)
# transform if fitted / fit & transform if not
Xpca = pca(X)
# Inverse transform
Xorig = pca(Xpca, :inverse_transform)
NovaML supports piped data transformation and model training.
using NovaML.PreProcessing: StandardScaler
using NovaML.Decomposition: PCA
using NovaML.LinearModel: LogisticRegression
sc = StandardScaler()
pca = PCA(n_components=2)
lr = LogisticRegression()
# transform the data and fit the model
Xtrn |> sc |> pca |> X -> lr(X, ytrn)
# make predictions
ŷtst = Xtst |> sc |> pca |> lr
It is also possible to create pipelines using NovaML's Pipe
constructor:
using NovaML.Pipelines: pipe
# create a pipeline
pipe = pipe(
StandardScaler(),
PCA(n_components=2),
LogisticRegression())
# fit the pipe
pipe(Xtrn, ytrn)
# make predictions
ŷ = pipe(Xtst)
# make probability predictions
ŷprobs = pipe(Xtst, type=:probs)
using NovaML.PreProcessing: StandardScaler
using NovaML.SVM: SVC
using NovaML.PipeLines: Pipe
using NovaML.ModelSelection: GridSearchCV
scaler = StandardScaler()
svc = SVC()
pipe_svc = Pipe(scaler, svc)
param_range = [0.0001, 0.001, 0.01, 0.1]
param_grid = [
[svc, (:C, param_range), (:kernel, [:linear])],
[svc, (:C, param_range), (:gamma, param_range), (:kernel, [:rbf])]
]
gs = GridSearchCV(
estimator=pipe_svc,
param_grid=param_grid,
scoring=accuracy_score,
cv=10,
refit=true
)
gs(X_train, y_train)
println(gs.best_score)
println(gs.best_params)
Contributions to NovaML.jl are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License.