accuracy variation depending on the number of GPUs used #2

zhl98 · 2022-03-31T03:43:38Z

Hello,Thank you very much for your code！
I used the setting of dytox in the code for 10 steps of training, but I failed to achieve the accuracy in the paper.
bash train.sh 0 --options options/data/cifar100_10-10.yaml options/data/cifar100_order1.yaml options/model/cifar_dytox.yaml --name dytox --data-path MY_PATH_TO_DATASET --output-basedir PATH_TO_SAVE_CHECKPOINTS
Here is the reproduction result：

avg acc is 69.54.
Can you give me some advice? thank you very much!

The text was updated successfully, but these errors were encountered:

arthurdouillard · 2022-03-31T08:54:19Z

After cleaning the code I've only tested for cifar 50 steps where results where exactly reproduced. I'm re-launching 10 steps to check that.

zhl98 · 2022-03-31T08:58:34Z

OK, thank you very much!

arthurdouillard · 2022-04-01T09:10:04Z

Hey, so I haven't time to full reproduce 10 steps with a single GPU but the first 5 steps are indeed like yours.
While when runned with 2 GPUs, I got the exact (even a little better) results from my paper.

I think the error comes from that with two GPUs, I'm actually using a batch size twice larger (PyTorch's DDP will use batch_size on each GPU). So my batch size is bigger than yours which can explain the results.

So what you can do is modifying the cifar_dytox.yaml, and increase the batch size to 256 (128*2).
This option file should work:

#######################
# DyTox, for CIFAR100 #
#######################

# Model definition
model: convit
embed_dim: 384
depth: 6
num_heads: 12
patch_size: 4
input_size: 32
local_up_to_layer: 5
class_attention: true

# Training setting
no_amp: true
eval_every: 50

# Base hyperparameter
weight_decay: 0.000001
batch_size: 128
incremental_lr: 0.0005
incremental_batch_size: 256  # UPDATE VALUE
rehearsal: icarl_all

# Knowledge Distillation
auto_kd: true

# Finetuning
finetuning: balanced
finetuning_epochs: 20

# Dytox model
dytox: true
freeze_task: [old_task_tokens, old_heads]
freeze_ft: [sab]

# Divergence head to get diversity
head_div: 0.1
head_div_mode: tr

# Independent Classifiers
ind_clf: 1-1
bce_loss: true


# Advanced Augmentations, here disabled

## Erasing
reprob: 0.0
remode: pixel
recount: 1
resplit: false

## MixUp & CutMix
mixup: 0.0
cutmix: 0.0

If you have time to tell me if it's working better great, otherwise I'll check it in the coming weeks.

Since I'm 100% sure the results are reproducible with two GPUs, the problem must be that.

zhl98 · 2022-04-03T07:29:08Z

Hey, after update the incremental_batch_size to 256 , runned with 1 GPU ,the result is still only 69.50%.

But it does seem that the effect of two GPUs is better.
I tested dytox_plus with 2 GPUS getting avg 76.17% (even a little better than your paper).

arthurdouillard · 2022-04-03T11:53:00Z

Hum... I'm launching experiments with batch size of 256 (the yaml that I gave you only did it for step t>1 not t=0 my bad), with a LR of 0.0005 (the default one) and a LR of 0.001 (twice bigger as it would have been if using two GPUs).

I'm also enabling mixed-precision (no_amp: false) to go faster.

I'll keep you updated.

Kishaan · 2022-05-10T08:39:28Z

HI,

Posting it here because I'm having the same issue. I ran the Dytox model on Cifar-100 with the same setting as in the first comment here, on a single GPU, and I'm getting the following log

{"task": 0, "epoch": 499, "acc": 92.5, "avg_acc": 92.5, "forgetting": 0.0, "acc_per_task": [92.5], "train_lr": 1.0004539958280581e-05, "bwt": 0.0, "fwt": 0.0, "test_acc1": 92.5, "test_acc5": 99.4, "mean_acc5": 99.4, "train_loss": 0.05053, "test_loss": 0.36721, "token_mean_dist": 0.0, "token_min_dist": 0.0, "token_max_dist": 0.0}
{"task": 1, "epoch": 19, "acc": 85.55, "avg_acc": 89.02, "forgetting": 0.0, "acc_per_task": [87.7, 83.4], "train_lr": 1.2500000000000004e-05, "bwt": 0.0, "fwt": 87.7, "test_acc1": 85.55, "test_acc5": 96.95, "mean_acc5": 98.18, "train_loss": 0.03499, "test_loss": 0.80777, "token_mean_dist": 0.54355, "token_min_dist": 0.54355, "token_max_dist": 0.54355}
{"task": 2, "epoch": 19, "acc": 78.67, "avg_acc": 85.57, "forgetting": 6.25, "acc_per_task": [80.0, 74.0, 82.0], "train_lr": 1.2500000000000004e-05, "bwt": -4.17, "fwt": 80.57, "test_acc1": 78.67, "test_acc5": 94.9, "mean_acc5": 97.08, "train_loss": 0.0259, "test_loss": 1.07032, "token_mean_dist": 0.58243, "token_min_dist": 0.53487, "token_max_dist": 0.61953}
{"task": 3, "epoch": 19, "acc": 73.32, "avg_acc": 82.51, "forgetting": 11.6, "acc_per_task": [71.3, 69.8, 70.6, 81.6], "train_lr": 1.2500000000000004e-05, "bwt": -7.88, "fwt": 75.57, "test_acc1": 73.33, "test_acc5": 93.1, "mean_acc5": 96.09, "train_loss": 0.02083, "test_loss": 1.37981, "token_mean_dist": 0.58081, "token_min_dist": 0.52581, "token_max_dist": 0.61908}
{"task": 4, "epoch": 19, "acc": 69.46, "avg_acc": 79.9, "forgetting": 16.5, "acc_per_task": [65.3, 65.9, 60.7, 71.7, 83.7], "train_lr": 1.2500000000000004e-05, "bwt": -11.33, "fwt": 71.7, "test_acc1": 69.46, "test_acc5": 92.04, "mean_acc5": 95.28, "train_loss": 0.0163, "test_loss": 1.65585, "token_mean_dist": 0.58517, "token_min_dist": 0.51872, "token_max_dist": 0.62832}
{"task": 5, "epoch": 19, "acc": 68.23, "avg_acc": 77.96, "forgetting": 19.32, "acc_per_task": [64.1, 59.3, 54.6, 64.9, 79.3, 87.2], "train_lr": 1.2500000000000004e-05, "bwt": -13.99, "fwt": 69.28, "test_acc1": 68.23, "test_acc5": 91.15, "mean_acc5": 94.59, "train_loss": 0.01265, "test_loss": 1.64966, "token_mean_dist": 0.6064, "token_min_dist": 0.5128, "token_max_dist": 0.70423}
{"task": 6, "epoch": 19, "acc": 64.01, "avg_acc": 75.96, "forgetting": 22.3, "acc_per_task": [60.5, 52.0, 48.8, 56.2, 71.9, 80.3, 78.4], "train_lr": 1.2500000000000004e-05, "bwt": -16.37, "fwt": 67.09, "test_acc1": 64.01, "test_acc5": 89.11, "mean_acc5": 93.81, "train_loss": 0.01232, "test_loss": 1.96759, "token_mean_dist": 0.60002, "token_min_dist": 0.50834, "token_max_dist": 0.7036}
{"task": 7, "epoch": 19, "acc": 60.25, "avg_acc": 74.0, "forgetting": 25.642857, "acc_per_task": [55.3, 46.9, 43.2, 50.9, 60.3, 74.3, 65.3, 85.8], "train_lr": 1.2500000000000004e-05, "bwt": -18.69, "fwt": 64.47, "test_acc1": 60.25, "test_acc5": 87.64, "mean_acc5": 93.04, "train_loss": 0.00952, "test_loss": 2.14214, "token_mean_dist": 0.59949, "token_min_dist": 0.50265, "token_max_dist": 0.70439}
{"task": 8, "epoch": 19, "acc": 58.38, "avg_acc": 72.26, "forgetting": 28.075, "acc_per_task": [53.6, 42.7, 41.5, 48.0, 53.9, 67.2, 57.3, 77.7, 83.5], "train_lr": 1.2500000000000004e-05, "bwt": -20.77, "fwt": 62.42, "test_acc1": 58.38, "test_acc5": 85.98, "mean_acc5": 92.25, "train_loss": 0.00978, "test_loss": 2.24582, "token_mean_dist": 0.59777, "token_min_dist": 0.49842, "token_max_dist": 0.70554}
{"task": 9, "epoch": 19, "acc": 54.61, "avg_acc": 70.5, "forgetting": 31.277778, "acc_per_task": [50.0, 39.4, 32.4, 44.1, 47.7, 63.2, 49.8, 66.5, 74.0, 79.0], "train_lr": 1.2500000000000004e-05, "bwt": -22.87, "fwt": 60.31, "test_acc1": 54.61, "test_acc5": 83.76, "mean_acc5": 91.4, "train_loss": 0.00789, "test_loss": 2.54448, "token_mean_dist": 0.59817, "token_min_dist": 0.49496, "token_max_dist": 0.70778}
{"avg": 70.49870843967983}

Is this accuracy expected? The final accuracy (54.61) is lower than the number I see on the paper for cifar-100, 10 steps. I'm trying to understand how multi-gpu training alone can bring in such a big improvement. Any help would be much appreciated.

arthurdouillard · 2022-05-10T09:32:22Z

Hello, I'm still trying to improve perfs on a single GPU. I'll keep this issue updated if I find ways to do it.

In the mean time, try running on two GPUs, as the results have been reproduced by multiple people (including @zhl98 for openned this issue).

Kishaan · 2022-05-23T07:40:32Z

Hi,

Just a short update. I thought repeated augmentation could be the reason behind improved results in multi-GPU, so I ran it without RA, but I was still getting around 59% accuracy, which means that cannot be the reason. Please let us know if you were able to figure out how to make it work in single-GPU setting.

arthurdouillard · 2022-05-23T09:44:09Z

Yeah, I chatted with Hugo Touvron (the DeiT main author) and he also suggested RA. I've tried multi-gpu without RA and single-gpu with RA, and nothing significantly changed.

I'll keep you updated.

arthurdouillard · 2022-06-16T20:10:10Z

Accuracy variation is in major part explained in the following erratum.
We are trying to see how we could emulate our distributed memory (see erratum) in the single GPU setting.

arthurdouillard self-assigned this Apr 1, 2022

arthurdouillard changed the title ~~accuracy~~ accuracy variation depending on the number of GPUs used Apr 3, 2022

arthurdouillard mentioned this issue May 31, 2022

imagenet-100 result cannot reproduce #6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

accuracy variation depending on the number of GPUs used #2

accuracy variation depending on the number of GPUs used #2

zhl98 commented Mar 31, 2022

arthurdouillard commented Mar 31, 2022

zhl98 commented Mar 31, 2022

arthurdouillard commented Apr 1, 2022 •

edited

Loading

zhl98 commented Apr 3, 2022

arthurdouillard commented Apr 3, 2022

Kishaan commented May 10, 2022 •

edited

Loading

arthurdouillard commented May 10, 2022

Kishaan commented May 23, 2022

arthurdouillard commented May 23, 2022

arthurdouillard commented Jun 16, 2022

accuracy variation depending on the number of GPUs used #2

accuracy variation depending on the number of GPUs used #2

Comments

zhl98 commented Mar 31, 2022

arthurdouillard commented Mar 31, 2022

zhl98 commented Mar 31, 2022

arthurdouillard commented Apr 1, 2022 • edited Loading

zhl98 commented Apr 3, 2022

arthurdouillard commented Apr 3, 2022

Kishaan commented May 10, 2022 • edited Loading

arthurdouillard commented May 10, 2022

Kishaan commented May 23, 2022

arthurdouillard commented May 23, 2022

arthurdouillard commented Jun 16, 2022

arthurdouillard commented Apr 1, 2022 •

edited

Loading

Kishaan commented May 10, 2022 •

edited

Loading