GSM8K drop #11

Mihaiii · 2024-01-05T07:07:23Z

Mihaiii
Jan 5, 2024

Hi!

Right now the HF leaderboard has multiple models that have LASER interventions and all of them seem to have a drop in GSM8K benchmark results versus their base model.

Is this a known behavior? The paper talks about enhancing reasoning abilities and GSM8K should be closer related to reasoning than other benchmarks.

I thought it's an interesting subject to discuss.

dkmisra · 2024-01-05T16:01:42Z

dkmisra
Jan 5, 2024
Collaborator

We haven't tried GSM8k so this is not known to us. I wonder if the "math" nature of GSM8k makes it quite different than more QA-style question answering datasets, or inference style datasets we tried. Can you paste the LASER hypermaraeters that you have tried for your LLM below? Also, what is the number of layers in the LLM? I saw you used 51 which seems like this model is much deeper than Llama2 which is 32 layers, GPTJ which is 28 layers, and Roberta which is 12 layers.

I do want to try GSM8k with LASER+Llama and a very exhaustive search of hyperparameters. Currently, I am finishing experiments on Phi-1.5, and then I can look into this and the Mistral LLM request.

5 replies

Mihaiii Jan 5, 2024
Author

I used Yi model, which has a Llama2 architecture and has 60 layers.

I don't think it's only related to my experiments, although in my experiments it's more obvious since I used the latest previous experiment as base model (and GSM8K keeps decreasing more and more, with each layer I make an intervention on).

I wouldn't have mentioned this if it was just for my cases, but this also happened in the case of mlabonne/NeuralHermes-2.5-Mistral-7B-laser :

I also wanted to submit to the leaderboard this model: thecognitivecomputations/dolphin-2.6-mistral-7b-dpo-laser , but I can't at the moment due to a threshold they are having regarding number of submissions.

If you run GSM8K on some of your models you applied LASER to, how does the score compares to the base model?

dkmisra Jan 5, 2024
Collaborator

These people see a 1% gain here: https://huggingface.co/cognitivecomputations/dolphin-2.6-mistral-7b-dpo-laser
Maybe check out their setting. If I try GSM8k, I'll definitely post results.

What will also be helpful is pasting the hyperparameters that you tried. E.g., I see that you tried reductions in layers 51 to 58 with all MLP layers being affected. Maybe just try affecting fc_in or fc_out or fc_up. Hyperparameters are quite important.

Mihaiii Jan 7, 2024
Author

These people see a 1% gain here: https://huggingface.co/cognitivecomputations/dolphin-2.6-mistral-7b-dpo-laser

That's strange because their laser model also has a GSM8K drop versus the base model in the HF leaderboard.

What will also be helpful is pasting the hyperparameters that you tried

In all my experiments, I always did reductions with either "attn", either "mlp". I never used "fc_in", nor "fc_up".

Name	lnames	lnum	Validation acc (higher is better)	Validation logloss (lower is better)	Test acc (higher is better)	Test logloss (lower is better)
Pallas-0.5	-	-	55.263	1.650	60.526	1.463
Pallas-0.5-LASER-0.1	attn	59	55.263	1.639	61.184	1.451
Pallas-0.5-LASER-0.2	attn	58	55.263	1.646	61.184	1.458
Pallas-0.5-LASER-0.3	mlp	56	55.263	1.575	61.842	1.382
Pallas-0.5-LASER-0.4	mlp	55	55.263	1.525	61.842	1.326
Pallas-0.5-LASER-0.5	mlp	54	55.263	1.484	61.842	1.297
Pallas-0.5-LASER-0.6	mlp	51	55.263	1.455	61.184	1.283

Maybe just try affecting fc_in or fc_out or fc_up.

Sure, makes sense. I just thought it's interesting this drop happens with GSM8K, but not other benchmarks.

dkmisra Jan 7, 2024
Collaborator

Thanks a lot for sharing these results, @Mihaiii !

GSM8k could be different. Math problems seem a very different type of beast compared to question-answering style tasks such as CounterFact. But please do note that for most of our tasks, we tried something like ~100-200 possible hyperparameter values (typically, last 10 layers, 10 possible rate reductions and fc_in and fc_out values).

I'll eventually try GSM8k but this may not be for at least a week or two. Anyway, let me setup the leaderboard, paste all results, and I guess we will find out once the dust settles in a month or so.

dkmisra Jan 9, 2024
Collaborator

A small update. The website now contains a list of results. I'll be happy to add your results and cite you, if that is still okay. Just let me know what links you want to add.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GSM8K drop #11

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

GSM8K drop #11

Mihaiii Jan 5, 2024

Replies: 1 comment · 5 replies

dkmisra Jan 5, 2024 Collaborator

Mihaiii Jan 5, 2024 Author

dkmisra Jan 5, 2024 Collaborator

Mihaiii Jan 7, 2024 Author

dkmisra Jan 7, 2024 Collaborator

dkmisra Jan 9, 2024 Collaborator

Mihaiii
Jan 5, 2024

Replies: 1 comment 5 replies

dkmisra
Jan 5, 2024
Collaborator

Mihaiii Jan 5, 2024
Author

dkmisra Jan 5, 2024
Collaborator

Mihaiii Jan 7, 2024
Author

dkmisra Jan 7, 2024
Collaborator

dkmisra Jan 9, 2024
Collaborator