Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug No convergence using bayesian LSTMCell #474

Closed
pbischoff opened this issue May 11, 2021 · 5 comments
Closed

bug No convergence using bayesian LSTMCell #474

pbischoff opened this issue May 11, 2021 · 5 comments

Comments

@pbischoff
Copy link
Contributor

I've been trying to use bayesian LSTM layers in my research and always experience the same issue that the models loss is converging but the accuracy is staying somewhere around 0.5 for a binary classification task.

To make sure it is not actually a problem within my data I set up a number of experiments using EmbeddedReberGrammar and classifying whether or not a string is a valid ERG or not. This is a fairly simple task for RNN and especially for LSTM but also used as benchmark in Long Short-Term Memory (Hochreiter, Schmidhuber, 1997).

I set up 4 experiments, which are all running with the same data:

Experiment No 1: Simple RNN Cell

The code to build the model is simply:

inputs = tf.keras.layers.Input(shape=(12,7))
cell = tf.keras.layers.SimpleRNNCell(4, activation='tanh')
rnn = tf.keras.layers.RNN(cell)(inputs)
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(rnn)
model = tf.keras.models.Model(inputs=inputs, outputs=outputs)
model.compile(loss='binary_crossentropy', metrics=['accuracy'])
hist = model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=100, verbose=1)

As seen below the model converges slowly but clearly and learns to classify valid strings with an accuracy around 80 %.

Screenshot 2021-05-11 094450

Experiment No 2: Standard LSTM layer

inputs = tf.keras.layers.Input(shape=(12,7))
rnn = tf.keras.layers.LSTM(4)(inputs)
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(rnn)
model = tf.keras.models.Model(inputs=inputs, outputs=outputs)
model.compile(loss='binary_crossentropy', metrics=['accuracy'])
hist = model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=100, verbose=1)

This model is converging not only faster but also with higher accuracy around 90 %. This is expected behaviour.

Screenshot 2021-05-11 094804

Experiment No 3: Bayesian Dense layers as output layers

In this case I am using a DenseFlipout layer from tensorflow-probability before the output layer.

kl_div = (lambda q, p, _: tfp.distributions.kl_divergence(q, p) / tf.cast(X_train.shape[0], dtype=tf.float32))
inputs = tf.keras.layers.Input(shape=(12,7))
rnn = tf.keras.layers.LSTM(4)(inputs)
dense = tfp.layers.DenseFlipout(20, activation='sigmoid', kernel_divergence_fn=kl_div)(rnn)
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(dense)
model = tf.keras.models.Model(inputs=inputs, outputs=outputs)
model.compile(loss='binary_crossentropy', metrics=['accuracy'])
hist = model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=100, verbose=1)

This model is still working fine, the training behaviour is very similiar to Experiment No 2.

Screenshot 2021-05-11 095806

Experiment No 4: LSTMCellFlipout by edward2 without the DenseFlipout layer

inputs = tf.keras.layers.Input(shape=(12,7))
lstmcell = ed.layers.LSTMCellFlipout(4)
rnn = tf.keras.layers.RNN(lstmcell)(inputs)
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(rnn)
model = tf.keras.models.Model(inputs=inputs, outputs=outputs)
model.compile(loss='binary_crossentropy', metrics=['accuracy'])
hist = model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=100, verbose=1)

This is where the problems start: as you can see in the training metrics below, the loss is still converging but the accuracy is not.

Screenshot 2021-05-11 100309

To be clear here: this is a very simple model obviously but I have actually tried different things of which none showed any improvement in the behaviour of the model. This is a list of things I tried:

  • experiemting with different initializers and regularizers
  • using trainable and non trainable standard deviations for the regularizer with different initial values
  • using multiple layers of the form tf.keras.layers.RNN(ed.layers.LSTMCellFlipout(n))
  • using the LSTMCellReparameterization class instead of the flipout version

Do you have any examples using the bayesian LSTM cells where this behaviour does not occur? Do you have any idea where this behaviour could come from?

@dustinvtran
Copy link
Member

dustinvtran commented May 11, 2021

@dusenberrymw has quite a bit of experience with Bayesian LSTMs and may be able to help. A codebase with it is https://github.com/Google-Health/records-research/tree/master/model-uncertainty.

For me, the weirdest thing is the loss curve, where it apparently takes quite a few epochs before it goes <100. The other RNNs start at <1. Some ideas:

  • Is the loss the same unit-wise? For example, is the KL properly divided by the minibatch size if you compute reduce_mean on the NLL component? If you want to keep model.fit, you'll need to change the regularizers to be their default but divided by the batch_size. That way, Keras computes cross_entropy + regularizer, where cross_entropy is tf.reduce_mean(nll) and regularizer is kl/batch_size.

  • Gradient clipping is quite important for Bayesian LSTMs. Have you tried this?

  • The issue that loss goes down but accuracy doesn't change seems very strange. How are you computing predictions?

@pbischoff
Copy link
Contributor Author

pbischoff commented May 12, 2021

Many thanks for your ideas @dustinvtran! Your are right that they are not the same unit-wise. As seen in Experiment No. 3 in the first line of the code-snippet, there the kl_div function includes dividing the divergence by the number of training samples (this comes from a tutorial from the tensorflow_probability website). If this is not done, the behaviour of Exp No. 3 changes to the same as in Exp No. 4.

  • I tried scaling the the divergence by using a custom training loop, roughly as seen on this comment. But I actually had to scale by the number of training samples as well instead of using the batch size. Also what really seems to make a difference is using a larger batch size, I went from 32 to 256. This makes sense to me, as I suppose with a too small batch size the variance using nondeterministic parameters is too large and it prevents the model from learning?
  • Comparing two cases where I use gradient clipping vs not using it, I find that the model in this specific case doesn't train as well and it's a less stable training process (as can be compared in the following images). I assume though that this highly dependent on the data one is working with and can also be tuned by using the right clipping values.

Screenshot 2021-05-12 145147

Screenshot 2021-05-12 145209

This leads me to two questions regarding edward2 though:

  1. Since scaling the KL-divergence is obviously very important, isn't there a way to include scaling in edward2? Possibly by overwriting the the model.fit method from keras? Or including it in the layers itself?
  2. Basically the same question for gradient clipping: if it's so import to bayesian LSTM, is there a reason it's not implemented here? Or is that planned in the future?

I feel like this could improve the usability quite a bit and make it less complicated to switch from deterministic keras models.
If this is an option, I'd be happy to contribute something to the development!

@dustinvtran
Copy link
Member

Thanks for following up.

@pbischoff But I actually had to scale by the number of training samples as well instead of using the batch size

Oops, you're correct. That's the right constant to scale by!

To answer your questions:

  1. If you're using model.fit, you can scale the regularizer by setting the otherwise default argument.
ed.layers.LSTMCellFlipout(
    units=512,
    kernel_regularizer=ed.regularizers.NormalKLDivergence(scale_factor=1./dataset_size),
    recurrent_regularizer=ed.regularizers.NormalKLDivergence(scale_factor=1./dataset_size),
)

Here's an example doing that for a Wide Resnet CIFAR baseline. We thought about including the dataset size as a necessary argument to the layer. But ultimately, it seemed to complicate the Keras abstraction as that's a special setting of how regularizers work more broadly.

  1. Great question. It's hard to tell what consistently works for a lot of these models, so I think a useful first step is to make these experimentations possible. A second step is to improve documentation and prescriptions for practices on using the models (for this, we should add conclusions from this discussion as well as ample links to examples like @dusenberrymw's paper, uncertainty-baselines implementation, and any success you've had). If we indeed find the practices are so commonly used, then we should make that advice not only easy to find but also the default. Open to your thoughts here!

@pbischoff
Copy link
Contributor Author

Hi!

  1. Regarding the scaling, IMO it would be preferable to have another argument e.g. scaling_factor that gets applied to both regularizers. At least for me this was an source of error because I simply forgot to scale the recurrent_regularizer. I'm not sure though if there are cases where no scaling should be applied? But the argument could still default to 1. This would simplify user code from
ed.layers.LSTMCellFlipout(
    units=512,
    kernel_regularizer=ed.regularizers.NormalKLDivergence(scale_factor=1./dataset_size),
    recurrent_regularizer=ed.regularizers.NormalKLDivergence(scale_factor=1./dataset_size),
)

to

ed.layers.LSTMCellFlipout(
    units=512,
    scaling_factor=1./dataset_size
)

I think it would also be closer to the way it's implemented in tensorflow_probability, where one argument to the DenseFlipout layer is the kernel_divergence_fn.

  1. I agree on your plan for continuing on this. I am still working on implementing this for my orginial data after it's working with the embedded reber grammer. If you agree, I'd open a pull request to add the ERG example to the package and make it easier to find out about the Gradient Clippling and hopefully show some pitfalls that I found during my work.

@dustinvtran
Copy link
Member

Got it. Regarding 1: It's preferable to keep the regularizer semantics because that flexibility is quite often needed for tweaking BNN layers (e.g., adding L2 regularization on top, or even swapping the KL penalty to maximize for entropy or an alternative divergence).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants