-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to deal with categorical parameters #191
Comments
Hi, the advanced notebook suggests a very vert basic approach for discrete variables that doesn't work very well. To deal with categorical variables it's even more complicated. I'm currently implementing it myself and I found a way that works extremely well. Basically for every categorical variable you create n new parameters (where n is the number of possible categories for that variable). You can then use these variables as a one hot encoder, when testing the function you check the highest one of the n new variables and you use the category corresponding to that. To make the GP work proprelly you will also have to create a custom kernel function that inherites the propriety of the one used by default. The only difference is that when computing the covariance of two points, for every categorical variable you have to set the dummy variable with the max value to 1 and all other dummy variables to 0. A similar approach can be used for discrete numerical variables, by rounding the number withing the kernel function you will get a step-shaped Gaussian process that has much better proprieties then the approach suggested in the advanced notebook |
@raff7 Hi, thanks for sharing. Can you give me your code to realise that ? I try the method of the advanced notebook and it doesn't seem to work. For what I saw the algorythm failed on interpreting the right continuous variable to choose when some are categorical. |
Hi @amapic, I experimented with the suggestion of @raff7 myself and gladly share a code example. If I understand it correctly, the goal is to give the kernel function the actual values that the function-to-optimize will use so the correct covariances are calculated. It should suffice to subclass the kernel function and wrap it with the corresponding code. The class WrappedMatern(Matern):
def __init__(self, param_types, length_scale=1.0, length_scale_bounds=(1e-5, 1e5), nu=1.5):
self.param_types = param_types
super(WrappedMatern, self).__init__(length_scale, length_scale_bounds, nu)
def __call__(self, X, Y=None, eval_gradient=False):
X = X.copy() # copy X, just to be sure
# we collect the positions of the categorical variables in this dict
categorical_group_indices = defaultdict(list)
for i, ptype in enumerate(self.param_types):
if ptype == 'continuous':
pass # do nothing
elif ptype == 'discrete':
X[:, i] = np.round(X[:, i])
else:
categorical_group_indices[ptype].append(i)
# set binary max for categorical groups
for indices in categorical_group_indices.values():
max_col = np.argmax(X[:, indices], axis=1)
X[:, indices] = 0
X[range(X.shape[0]), max_col] = 1
return super(WrappedMatern, self).__call__(X, Y, eval_gradient) When creating the optimizer instance, you have to replace the def f(select_a, select_b, x):
x = int(x)
if select_a > select_b:
return -(x**2 + 2*x + 2) # has maximum -1 at x=-1
else:
return -x**2 # has maximum 0 at x=0
modified_kernel = WrappedMatern(param_types=(0,0,'discrete'), nu=2.5)
bo = BayesianOptimization(f, pbounds=dict(select_a=(0,1),
select_b=(0,1),
x=(-100,100)))
bo._gp = GaussianProcessRegressor(kernel=modified_kernel,
alpha=1e-2, #alpha=1e-6,
normalize_y=True,
n_restarts_optimizer=5,
random_state=bo._random_state) Here, Now you can run |
I posted about this in another repo I think there is an issue with simply rounding to discrete values due to the nature of BO. Since categorical parameters are unordered, there can be an unwanted bias in parameter selection simply due to the order we supply our parameters in. When we fit the GPR, the prior will be higher for values next to a maxima. While this is fine for an ordered space, for an unordered parameter space this means that some categorical values will be default be seen as higher priority search areas, all else the same. To solve this we simply randomize the order prior each iteration prior to each fit, and then use the current iteration's randomized order when selecting from the acquisition function. This ensure that we can still use categorical variables in BO, but no values have an unearned bias in priority. |
The rounding above only applies to numerical variables that are limited to integer values. The idea for categorical variables is to one-hot encode them, meaning if you have a variable with categories {'A', 'B', 'C'}, you create 3 helper variables each with values in the real interval [0,1] and set the highest one to 1 and the others to 0. If you think of it as a vector, the possible states (1,0,0), (0,1,0), (0,0,1) are all equidistant and no ordering is implied. This holds true for any number of categories. Here is a paper by Garrido-Merchán and Hernández-Lobato supporting the approach @raff7 described. @ben-arnao, I am not sure if/how your approach is affected by the problems described in this paper. Isn't the randomization just causing noise in the acquisition function, s.t. the BO cannot make any prediction w.r.t. to the categorical parameters? |
@leifvan Hmm, i see. Are the helper variables separate dimensions when it comes to fitting the GPR? They would need to be in order to avoid the ordering bias i am discussing. I think this approach could work but how would it affect BO from a dimensionality standpoint? It is known the BO breaks down in higher dimensional space. This technique seems like it could very quickly start to break down BO with more dimensions than the optimization is recommended for. |
Hi @leifvan @ben-arnao, thanks for your answers, it helped me a lot. |
@leifvan Juste to be sure, do you think this approach has downside ? I don't have enough knowledge myself for that. |
@amapic i cant think of any downside of rounding the variable withing the kernel, and from my experiments, I can say with a decent amount of confidence that this method is the best to deal with discrete numerical variables. from bisect import bisect_left
import GPy
from time import time
import numpy as np
class augKernel(GPy.kern.Matern52):
def __init__(self,input_dim, variance=1., ARD=False,space=None):
self.space = space
super(augKernel,self).__init__(input_dim=input_dim, variance=variance,ARD=ARD)
def K(self, X, X2=None):
# t = time()
catIDs = self.space.get_categorical_index_in_model()
for cat in catIDs:#normalize categorical for X
am = np.argmax(X[:,cat],axis=1)
X[:,cat]= np.zeros(X[:,cat].shape)
for i,a in enumerate(am):
n=np.zeros(X[i,cat].shape)
n[a]=1
X[i,cat] = n
if(X2 is not None):#normalize categorical for X2
for cat in catIDs:#normalize categorical for X
am = np.argmax(X2[:,cat],axis=1)
X2[:,cat]= np.zeros(X2[:,cat].shape)
for i,a in enumerate(am):
n=np.zeros(X2[i,cat].shape)
n[a]=1
X2[i,cat] = n
discIDs, values = self.space.get_discrete_index_in_model()
for disc,val in zip(discIDs,values): #round discrete data
for Xid in range(len(X)):
X[Xid][disc] = self.take_closest(val,X[Xid][disc])#min(val, key=lambda x:abs(x-X[Xid][disc]))
if(X2 is not None):
for disc,val in zip(discIDs,values): #round discrete data
for Xid in range(len(X2)):
X2[Xid][disc] = self.take_closest(val,X2[Xid][disc])#min(val, key=lambda x:abs(x-X2[Xid][disc]))#
# print("K time {}".format(time()-t))
return super().K( X, X2)
def Gram_matrix(self, F, F1, F2, F3, lower, upper):
raise NotImplementedError()
def take_closest(self, myList, myNumber):
"""
Assumes myList is sorted. Returns closest value to myNumber.
If two numbers are equally close, return the smallest number.
"""
pos = bisect_left(myList, myNumber)
if pos == 0:
return myList[0]
if pos == len(myList):
return myList[-1]
before = myList[pos - 1]
after = myList[pos]
if after - myNumber < myNumber - before:
return after
else:
return before p.s. i don't use round because sometimes the discrete numerical variables are not as simple as being consecutive integer.. say you have to decide between a batch size of 8, 16 and 32. |
@ben-arnao yes, the limitation of this method is indeed the extra dimensions it creates, however this is the only method to avoid an order bias in categorical variables, as @leifvan explained, your randomization method would probably not help much, and it will cause a lot of noise for the categorical variables, making them basically useless |
I wouldn't say it would make them useless, but the performance would probably be affected to some degree. I think it would be interesting to do some tests to see how degraded the performance would be with random optimization and onehotencoded params as baselines. In an ideal world, yes onehotencoding categoricals would be best but i think with many real-life BO problems, dimensionality is a serious consideration. I do think it is a tradeoff and there is a point in which one will outweigh the other. |
@ben-arnao yes you are right, mine are just speculations as I didn't test it, the only way to know for sure is to try, however I would expect the randomization to degrade performance even more than the "order bias" would, so I would suggest comparing your method of randomization with the simple method of considering categorical variables as simple numerical variables, just to make sure it isn't just worst than doing nothing... |
Multiple categorical parameters
I glossed over that part as my code was only intended as a proof-of-concept, but there is a mechanism for multiple categorical parameters. They are identified by some hashable value, e.g. in the following line from my example above modified_kernel = WrappedMatern(param_types=(0,0,'discrete'), nu=2.5) the WrappedMatern(param_types=('a','a','a',7,7,7)) I tested my code again yesterday and there is a breaking error in there. If provided, the Limitations of the one-hot encoding
I agree with you, @raff7. One additional problem that Nguyen et al. point out in their recent work is that both rounding and setting the covariances of categorical variables to 0 or 1
Regarding the discussion of the shuffling approachI'm not sure if I 100% understood how your approach works, @ben-arnao, but I will try to explain by example why I don't think it provides a benefit. Suppose we have a categorical variable with three choices
+1 for that! In the meantime, the paper of Nguyen et al. I referenced above contains experiments using synthetic & real-world data where they compare a naive one-hot encoding with the kernel rounding method (called "Merchan-Lobato" in the paper). The approach they are proposing also sounds promising, so if we really start benchmarking our suggestions it might be interesting to see how their approach stacks up. |
Hi! I was reading the README where it says:
Take a look at the advanced to our notebook to learn how to make the package more flexible, how to deal with categorical parameters, how to use observers, and more.
But I can't find any reference to how to use categorical parameters in the notebook, could someone throw some light into this?
Thanks
The text was updated successfully, but these errors were encountered: