Implicit Diff for hyperparam optimization with stochastic solver #227

marcociccone · 2022-06-04T23:51:46Z

marcociccone
Jun 4, 2022

Hi!

I'm playing around to better understand the mechanics of this great library before testing it in my current project.
In particular, I've coded a simple logistic regression for mnist (see the gist) and I would like to apply implicit diff to get the optimal regularization parameters l2reg for the validation set.

In the inner optimization loop I'm using a stochastic solver (sgd).
If I simply unroll with autodiff the code seems to be working, but if I decorate the inner solver with @custom_root(jax.grad(inner_loss, has_aux=True), has_aux=True) I get a weird error that I can't understand (see below).

First, is it possible to implicit diff through the stochastic solver?
If so, what am I missing?

Thanks for your help!

Traceback (most recent call last):
File "/home/marco/exp/jaxopt/examples/implicit_diff/plot_dataset_distillation.py", line 80, in
l2reg, outer_state = gd_outer.run(l2reg)
File "/home/marco/.miniconda3/lib/python3.9/site-packages/jaxopt/_src/base.py", line 214, in run
return run(init_params, *args, **kwargs)
File "/home/marco/.miniconda3/lib/python3.9/site-packages/jaxopt/_src/implicit_diff.py", line 251, in wrapped_solver_fun
return make_custom_vjp_solver_fun(solver_fun, keys)(*args, *vals)
File "/home/marco/.miniconda3/lib/python3.9/site-packages/jaxopt/_src/implicit_diff.py", line 207, in solver_fun_flat
return solver_fun(*args, **kwargs)
File "/home/marco/.miniconda3/lib/python3.9/site-packages/jaxopt/_src/base.py", line 177, in _run
opt_step = self.update(init_params, state, *args, **kwargs)
File "/home/marco/.miniconda3/lib/python3.9/site-packages/jaxopt/_src/gradient_descent.py", line 81, in update
return super().update(params, state, None, *args, **kwargs)
File "/home/marco/.miniconda3/lib/python3.9/site-packages/jaxopt/_src/proximal_gradient.py", line 265, in update
return f(params, state, hyperparams_prox, args, kwargs)
File "/home/marco/.miniconda3/lib/python3.9/site-packages/jaxopt/_src/proximal_gradient.py", line 234, in _update_accel
y_fun_val, y_fun_grad = self._value_and_grad_fun(y, *args, **kwargs)
File "/home/marco/.miniconda3/lib/python3.9/site-packages/jaxopt/_src/proximal_gradient.py", line 277, in _value_and_grad_fun
(value, aux), grad = self._value_and_grad_with_aux(params, *args, **kwargs)
File "/home/marco/exp/jaxopt/examples/implicit_diff/plot_dataset_distillation.py", line 69, in outer_loss
inner_sol = inner_loop_solver(params, jnp.exp(l2reg))
File "/home/marco/.miniconda3/lib/python3.9/site-packages/jaxopt/_src/implicit_diff.py", line 251, in wrapped_solver_fun
return make_custom_vjp_solver_fun(solver_fun, keys)(*args, *vals)
jax._src.source_info_util.JaxStackTraceBeforeTransformation: TypeError: missing a required argument: 'data'

Answered by Algue-Rythme

Jun 5, 2022

Hi,

In Jaxopt the general rule is that the signature of the decorated function inner_loop_solver and the optimality_fun(that you pass to custom root) must match.

Hence:

the inner_loop_solver must always take some sort of init_params with the pytree shape/dtype as the one returned; even though this parameter is often ignored (only useful for warm starts)
all the "hyper-parameters" you are interested in (i.e l2reg in your case) must appear in both inner_loop_solver and optimality_fun

As you noticed, the data: Tuple[jnp.ndarray, jnp.ndarray] argument of l2_multiclass_logreg must appear in inner_loop_solver signature; and there is a very good reason for that.

implicit diff is not possible …

View full answer

marcociccone · 2022-06-05T21:17:29Z

marcociccone
Jun 5, 2022
Author

I debugged the error more, and I found out that the problem occurs when jaxopt calls the optimality condition jax.grad(objective.l2_multiclass_logreg), but this has a different signature from the inner_loop_solver function.
Indeed, the optimality condition requires the data, but the solver function is just scanning the iterator train_ds so no need to pass it as an argument (also you can't because it is not a jax object).

Is that really necessary? I'm thinking that implicit diff is not possible with stochastic solvers at the moment.
@mblondel do you have any thoughts on that?

0 replies

Algue-Rythme · 2022-06-05T23:10:27Z

Algue-Rythme
Jun 5, 2022
Collaborator

Hi,

In Jaxopt the general rule is that the signature of the decorated function inner_loop_solver and the optimality_fun(that you pass to custom root) must match.

Hence:

the inner_loop_solver must always take some sort of init_params with the pytree shape/dtype as the one returned; even though this parameter is often ignored (only useful for warm starts)
all the "hyper-parameters" you are interested in (i.e l2reg in your case) must appear in both inner_loop_solver and optimality_fun

As you noticed, the data: Tuple[jnp.ndarray, jnp.ndarray] argument of l2_multiclass_logreg must appear in inner_loop_solver signature; and there is a very good reason for that.

implicit diff is not possible with stochastic solvers at the moment

Let's take a step back at your problem for a moment. First we need to remember that Jax is a functional language without side effects that despites any some sort of modification of a global state; this allows easy translation between maths and code.

You want to differentiate inner_sol with respect to l2reg. Obviously, without much more precision, this problem is ill-posed. Indeed, the optimal inner_sol depends of input data. So, input data must appears in optimality_fun signature (as it is the case here). But we also need to pass these input data to Jaxopt. This is done automatically if data appears in inner_loop_solver.

The drawback: an iterator is not a Jax object, it is even mutable (!!), which contradicts the no side effect postulate of Jax. To solve this you need to iterate over your dataset (for a total of inner_iters batchs) and stack everything into a list. Doing so, you lose the advantage of iterators (low memory consumption, caching on disk by reading on the fly etc.). But anyway, it is unavoidable: for implicit differentiation to make sense in the first place, you need to pass data to Jaxopt (otherwise the problem is ill-posed as already mentioned). And since Implicit Differentiation requires all the data to work (a big linear problem is solved somewhere deep down in the library), you cannot benefit from the advantages of iterators anyway: you must stack the batchs.

Solution 1

This solution is the well posed one that drops stochasticity of inner problem. The outer problem remains stochastic as you can see.

@custom_root(jax.grad(inner_loss, has_aux=True), has_aux=True)
def inner_loop_solver(params, l2reg, data):
    inner_sol = params
    state = solver.init_state(inner_sol)
    for idx in range(inner_iters):
        print(idx)
        batch = data[idx]  # simulate iterations over minibatchs, with data a list or any pytree you like
        inner_sol, state = solver.update(
            params=inner_sol,
            state=state,
            l2reg=l2reg,
            data=(batch[0].reshape(-1, 784) / 255., batch[1])
        )
    return inner_sol

# we now construct the outer loss and perform gradient descent on it
def outer_loss(l2reg, data):
    inner_sol = inner_loop_solver(params, jnp.exp(l2reg), data)
    print("Outer iter")

    return objective.l2_multiclass_logreg(
        W=inner_sol, l2reg=0, data=(images_val, labels_val)), inner_sol

gd_outer = GradientDescent(fun=outer_loss, tol=1e-3, maxiter=50, has_aux=True)
data = [batch for batch in ds_train.take(inner_iters)]  # create the sequence of minibatchs
outer_state = gd_outer.init_state(l2reg)
for _ in range(outer_iters):
   data = [batch for batch in ds_train.take(inner_iters)]  # create the sequence of minibatchs
   l2reg, outer_state = gd_outer.update(l2reg, outer_state, data)  # use mini batchs for current inner minimization step

This solution is really the only one that makes sense from a mathematical viewpoint.

Solution 2

As a second thought, we could actually want to find inner_sol using several minibatchs (as usual with an iterator) and compute the derivative of inner_sol with respect to l2reg using only a single minibatch. If you assume you minibatch representative from the whole dataset, it might make sense. In this case it is very important that the inner_sol found is optimal for the representative minibatch. This typically holds in the case of over-parametrized neural networks trained in ERM paradigm (ideally with the reprsentative minibatch being part of the dataset iterator). I would be more cautious in other situations.

In this case, you can chose the representative mini-batch in advance and use it for implicit diff, hoping for the best: it is worth checking the value of optimality-fun as a sanity check in this case.

@custom_root(jax.grad(inner_loss, has_aux=True), has_aux=True)
def inner_loop_solver(params, l2reg, data):  # data is now your representative minibatch: you must choose it in advance
    inner_sol = params
    state = solver.init_state(inner_sol)
    for idx in range(inner_iters):
        print(idx)
        batch = next(ds_train)
        inner_sol, state = solver.update(
            params=inner_sol,
            state=state,
            l2reg=l2reg,
            data=(batch[0].reshape(-1, 784) / 255., batch[1])
        )
    return inner_sol

# we now construct the outer loss and perform gradient descent on it
def outer_loss(l2reg, data):
    inner_sol = inner_loop_solver(params, jnp.exp(l2reg), data)
    print("Outer iter")

    return objective.l2_multiclass_logreg(
        W=inner_sol, l2reg=0, data=(images_val, labels_val)), inner_sol

gd_outer = GradientDescent(fun=outer_loss, tol=1e-3, maxiter=50, has_aux=True)
data = next(ds_train)
outer_state = gd_outer.init_state(l2reg)
for _ in range(outer_iters):
   data = next(ds_train)  # use current minibatch as a "representative minibatch" of the whole dataset
   l2reg, outer_state = gd_outer.update(l2reg, outer_state, data)  # use representative mini batch for current inner minimization step

The solution 2 is closer from what you are trying to achive. Giving formal guarantees about the soundness of this approach is certainly possible but requires some work and thinking about the well posedness of your problem.

0 replies

marcociccone · 2022-06-07T10:03:08Z

marcociccone
Jun 7, 2022
Author

Hi!
thank you for the great explanation, that saved me a lot of headaches and clarified many of my doubts!
For future reference, I have updated the gist with your solution 1, which needed a few changes to work with the objective function.

One thing that I noticed is that implicit diff is actually faster and more precise than unrolling the computational graph, but the memory footprint doesn't seem to be constant as I expected.
Actually, the training gets slower and the memory consumption grows using implicit diff with the number of iterations of the outer loop.

It feels like a memory leak, are you aware of anything like it?

You can try my example monitoring the gpu memory allocated and check the difference between unrolling and implicit diff using for instance 500 outer iterations.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implicit Diff for hyperparam optimization with stochastic solver #227

{{title}}

Replies: 3 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Implicit Diff for hyperparam optimization with stochastic solver #227

marcociccone Jun 4, 2022

Replies: 3 comments

marcociccone Jun 5, 2022 Author

Algue-Rythme Jun 5, 2022 Collaborator

Solution 1

Solution 2

marcociccone Jun 7, 2022 Author

marcociccone
Jun 4, 2022

marcociccone
Jun 5, 2022
Author

Algue-Rythme
Jun 5, 2022
Collaborator

marcociccone
Jun 7, 2022
Author