During our survey, we found that it is quite challenging to conduct fair comparisons among different PEFT algorithms.
-
Under the same settings, Method A performs poorly in some papers but shows better results in others.
-
Various papers might use different settings for different tasks, involving factors such as the selection of layers for fine-tuning, the size of the trainable parameters, and the distinction between task datasets, among others.
-
We found that some methods could not reproduce the same results as reported in the original papers during our replication attempt, and they also haven't release their codes...
To address these issues, we have decided to make a preliminary attempt. We will test different methods under the same settings. Given that tasks like NLU are highly influenced by random seeds, we have chosen to conduct a unified evaluation on commonsense reasoning tasks. The results we report will satisfy one of the three conditions:
-
The method’s code is open-source, and a download link for the CR task weights is available. We have verified that these weights produce results consistent with the original paper.
-
The method’s code is open-source, but no official weights have been released. We report the results obtained by transferring the official code on CR task.
-
The method is not open-source, but we have successfully reproduced the results consistent with the original paper.
In conclusion, we will not report any results that we are unable to reproduce. We will also provide the weights corresponding to the CR results if they are reported in our paper. We hope that our attempt can assist researchers and contribute to the advancement of this field.
Method | Params | BoolQ | PIQA | SIQA | HellaS. | WinoG. | ARC-e | ARC-c | OBQA | Avg. |
---|---|---|---|---|---|---|---|---|---|---|
ChatGPT | - | 73.1 | 85.4 | 68.5 | 78.5 | 66.1 | 89.8 | 79.9 | 74.8 | 77.0 |
Fine-tuning LLaMA-7B | ||||||||||
Fully FT | 100% | 69.9 | 84.2 | 78.9 | 92.3 | 83.3 | 86.6 | 72.8 | 83.4 | 81.4 |
Prefix | 0.11% | 64.3 | 76.8 | 73.9 | 42.1 | 72.1 | 72.9 | 54.0 | 60.6 | 64.6 |
Series | 0.99% | 63.0 | 79.2 | 76.3 | 67.9 | 75.7 | 74.5 | 57.1 | 72.4 | 70.8 |
Parallel | 3.54% | 67.9 | 76.4 | 78.8 | 69.8 | 78.9 | 73.7 | 57.3 | 75.2 | 72.2 |
LoRAr=4 | 0.10% | 2.3 | 46.1 | 18.3 | 19.7 | 55.2 | 65.4 | 51.9 | 57.0 | 39.5 |
AdaLoRAr=4 | + 0.6k | 66.1 | 78.1 | 74.3 | 34.0 | 74.4 | 76.7 | 57.5 | 71.2 | 66.5 |
FLoRAr=4 | + 2.6k | 67.2 | 78.0 | 72.9 | 65.4 | 73.8 | 73.8 | 55.3 | 71.8 | 69.8 |
DoRAr=4 | + 877k | 51.3 | 42.2 | 77.8 | 25.4 | 78.8 | 78.7 | 62.5 | 78.6 | 61.9 |
LoRA-Dashr=4 | + 1.3k | 65.2 | 79.9 | 78.3 | 82.8 | 77.1 | 78.6 | 65.4 | 78.4 | 75.7 |
LoRAr=32 | 0.83% | 68.9 | 80.7 | 77.4 | 78.1 | 78.8 | 77.8 | 61.3 | 74.8 | 74.7 |
AdaLoRAr=32 | + 5.1k | 69.1 | 82.2 | 77.2 | 78.3 | 78.2 | 79.7 | 61.9 | 77.2 | 75.5 |
FLoRAr=32 | + 164k | 66.4 | 81.3 | 77.1 | 75.6 | 77.1 | 77.2 | 62.4 | 77.6 | 74.3 |
DoRAr=32 | + 877k | 69.7 | 83.4 | 78.6 | 87.2 | 81.0 | 81.9 | 66.2 | 79.2 | 78.4 |
LoRA-Dashr=32 | + 1.3k | 69.9 | 82.8 | 78.6 | 84.9 | 81.6 | 82.3 | 66.5 | 80.8 | 78.4 |
Fine-tuning LLaMA3-8B | ||||||||||
LoRAr=16 | 0.35% | 72.3 | 86.7 | 79.3 | 93.5 | 84.8 | 87.7 | 75.7 | 82.8 | 82.8 |
AdaLoRAr=16 | + 2.6k | 90.4 | 85.0 | 76.7 | 79.1 | 83.3 | 86.4 | 75.1 | 75.4 | 81.4 |
FLoRAr=16 | + 41k | 90.2 | 84.2 | 79.9 | 79.3 | 85.1 | 86.7 | 74.8 | 93.9 | 84.2 |
LoRA-Dashr=16 | + 1.3k | 74.8 | 88.0 | 80.6 | 95.2 | 85.6 | 89.0 | 77.4 | 84.8 | 84.4 |
LoRAr=32 | 0.70% | 70.8 | 85.2 | 79.9 | 91.7 | 84.3 | 84.2 | 71.2 | 79.0 | 80.8 |
PISSAr=32 | + 0 | 67.1 | 81.1 | 77.2 | 83.6 | 78.9 | 77.7 | 63.2 | 74.6 | 75.4 |
MiLoRAr=32 | + 0 | 68.8 | 86.7 | 77.2 | 92.9 | 85.6 | 86.8 | 75.5 | 81.8 | 81.9 |
DoRAr=32 | + 784k | 74.6 | 89.3 | 79.9 | 95.5 | 85.6 | 90.5 | 80.4 | 85.8 | 85.2 |
LoRA-Dashr=32 | + 1.3k | 75.3 | 88.5 | 80.2 | 95.7 | 86.8 | 90.7 | 80.2 | 85.6 | 85.4 |
We provide some weights of our work.
Method | Links |
---|---|
LoRA-Dash | Google Drive |
DoRA | Github |