Contextual Encoders

Contextual Encoders is a library of scikit-learn compatible contextual variable encoders.

The documentation can be found here: ReadTheDocs.

This package uses Poetry (documentation).

Installation

The library can be installed with pip

pip install contextual-encoders

What are contextual variables?

Contextual variables are numerical or categorical variables, that underlie a certain context or relationship. Examples are the days of the week, that have a hidden graph structure:

When encoding these categorical variables with a simple encoding strategy such as One-Hot-Encoding, the hidden structure will be neglected. However, when the context can be specified, this additional information can be put it into the learning procedure to increase the performance of the learning model. This is, where Contextual Encoders come into place.

Principle

The step of encoding contextual variables is split up into four sub-steps:

Define the context
Define the measure
Calculate the (dis-) similarity matrix
Map the distance matrix to euclidean vectors

Setp 4. is optional and depends on the ML technique that uses the encoding. For example, Agglomerative Clustering techniques do not require euclidean vectors, they can use a dissimilarity matrix directly.

Basic Usage

The code below demonstrates the basic usage of the library. Here, a simple dataset with 10 features is used.

from contextual_encoders import ContextualEncoder, GraphContext, PathLengthMeasure
import numpy as np


# Create a sample dataset
x = np.array(["Fri", "Tue", "Fri", "Sat", "Mon", "Tue", "Wed", "Tue", "Fri", "Fri"])

# Step 1: Define the context
day = GraphContext("day")
day.add_concept("Mon", "Tue")
day.add_concept("Tue", "Wed")
day.add_concept("Wed", "Thur")
day.add_concept("Thur", "Fri")
day.add_concept("Fri", "Sat")
day.add_concept("Sat", "Sun")
day.add_concept("Sun", "Mon")

# Step 2: Define the measure
day_measure = PathLengthMeasure(day)

# Step 3+4: Calculate (Dis-) similarity Matrix
#           and map to euclidean vectors
encoder = ContextualEncoder(day_measure)
encoded_data = encoder.transform(x)

similarity_matrix = encoder.get_similarity_matrix()
dissimilarity_matrix = encoder.get_dissimilarity_matrix()

The output of the code is visualized below. The graph-based structure can be clearly seen when the euclidean data points are plotted. Note, that only five points can be seen, because the days "Thur" and "Sun" are missing in the dataset.

Similarity Matrix	Dissimilarity Matrix	Euclidean Data Points

More complicated examples can be found in the documentation.

Notice

The Preprocessing module from scikit-learn offers multiple encoders for categorical variables. These encoders use simple techniques to encode categorical variables into numerical variables. Additionally, the Category Encoders package offers more sophisticated encoders for the same purpose. This package is meant to be used as an extension to the previous two packages in the cases, when the context of a numerical or categorical variable can be specified.

This project is currently in the developer stage.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.github/workflows		.github/workflows
contextual_encoders		contextual_encoders
docs		docs
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
examples.py		examples.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Contextual Encoders

Installation

What are contextual variables?

Principle

Basic Usage

Notice

About

Releases 1

Languages

License

daniel-fink-de/contextual-encoders

Folders and files

Latest commit

History

Repository files navigation

Contextual Encoders

Installation

What are contextual variables?

Principle

Basic Usage

Notice

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Languages