Why is it better to provide raw data to Khiops instead of preprocessed or encoded data? #510
-
This discussion is inspired by questions we often receive about data preparation for Khiops: |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Khiops is specifically designed to work directly with raw data, minimizing the need for manual preprocessing. In fact, it is strongly recommended to avoid preparing your data to prevent loss of information, whether it concerns variable encoding, missing values, or data flattening (propositionalization). Khiops employs the MODL formalism, based on the MDL (Minimum Description Length) principle, to encode variables optimally:
This optimal encoding is inherently tied to the modeling process and adapts dynamically to the data, making manual encoding methods unnecessary (and less effective !). Example: Why Encoding Isn’t Necessary Imagine a dataset with a categorical variable "city" containing hundreds of unique values. In a traditional workflow:
With Khiops:
Similarly, for a numerical variable "age", Khiops would:
Handling Missing Values Khiops also treats missing values as part of the data, recognizing that they can carry meaningful information. Missing values are not discarded or imputed arbitrarily. Instead, Khiops assigns them to their own group or interval if they are informative for the target variable. This approach ensures that missing data is leveraged effectively, rather than being treated as noise or ignored outright. Scientific Basis Khiops’ preprocessing capabilities are grounded in rigorous statistical principles. The MODL formalism:
For more details and examples, see this notebook tutorial or read about Optimal Encoding. This approach ensures that manual preprocessing is unnecessary (in fact, it is often counterproductive). |
Beta Was this translation helpful? Give feedback.
Khiops is specifically designed to work directly with raw data, minimizing the need for manual preprocessing. In fact, it is strongly recommended to avoid preparing your data to prevent loss of information, whether it concerns variable encoding, missing values, or data flattening (propositionalization).
Khiops employs the MODL formalism, based on the MDL (Minimum Description Length) principle, to encode variables optimally: