Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'assert (boxes1[:, 2:] >= boxes1[:, :2]).all()' happened when training #89

Open
Swall0w opened this issue May 25, 2021 · 4 comments
Open

Comments

@Swall0w
Copy link

Swall0w commented May 25, 2021

Thanks for your great work!!
When I applied AMP training on detectron2, I found some issues with boxes in the training.

Changed

The difference from the original code is here.

SOLVER:
    STEPS: (210000, 250000)
    MAX_ITER: 270000
    AMP:
        ENABLED: true

Error

[05/24 20:54:12 d2.engine.hooks]: Total training time: 0:00:10 (0:00:00 on hooks)
[05/24 20:54:12 d2.utils.events]:  iter: 0    lr: N/A  max_mem: 5095M
Traceback (most recent call last):
  File "train_net.py", line 134, in <module>
    launch(
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/launch.py", line 55, in launch
    mp.spawn(
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/launch.py", line 94, in _distributed_worker
    main_func(*args)
  File "/home/masato/works/SparseR-CNN/projects/SparseRCNN/train_net.py", line 128, in main
    return trainer.train()
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 431, in train
    super().train(self.start_iter, self.max_iter)
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 138, in train
    self.run_step()
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 441, in run_step
    self._trainer.run_step()
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 332, in run_step
    loss_dict = self.model(data)
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/masato/works/SparseR-CNN/projects/SparseRCNN/sparsercnn/detector.py", line 143, in forward
    loss_dict = self.criterion(output, targets)
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/masato/works/SparseR-CNN/projects/SparseRCNN/sparsercnn/loss.py", line 147, in forward
    indices = self.matcher(outputs_without_aux, targets)
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/masato/works/SparseR-CNN/projects/SparseRCNN/sparsercnn/loss.py", line 266, in forward
    cost_giou = -generalized_box_iou(out_bbox, tgt_bbox)
  File "/home/masato/works/SparseR-CNN/projects/SparseRCNN/sparsercnn/util/box_ops.py", line 51, in generalized_box_iou
    assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
AssertionError

Decreasing the learning rate doesn't work for me and this error occurs only mix training.
Is there any good suggestion to solve this problem?

Thank you.

@PeizeSun
Copy link
Owner

Hi~
Can you try to delete giou, including matching and loss, to see whether this error still occurs?

@Swall0w
Copy link
Author

Swall0w commented Jun 3, 2021

@PeizeSun
Thank you for your suggestion.
After commenting out the giou, i got a new error...

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/launch.py", line 94, in _distributed_worker
    main_func(*args)
  File "/home/fujitake/works/SparseR-CNN/projects/SparseRCNN/train_net.py", line 128, in main
    return trainer.train()
  File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 431, in train
    super().train(self.start_iter, self.max_iter)
  File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 138, in train
    self.run_step()
  File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 441, in run_step
    self._trainer.run_step()
  File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 332, in run_step
    loss_dict = self.model(data)
  File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/fujitake/works/SparseR-CNN/projects/SparseRCNN/sparsercnn/detector.py", line 143, in forward
    loss_dict = self.criterion(output, targets)
  File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/fujitake/works/SparseR-CNN/projects/SparseRCNN/sparsercnn/loss.py", line 147, in forward
    indices = self.matcher(outputs_without_aux, targets)
  File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/fujitake/works/SparseR-CNN/projects/SparseRCNN/sparsercnn/loss.py", line 274, in forward
    indices = [linear_sum_assignment(c[i]) for i, c in enumerate(C.split(sizes, -1))]
  File "/home/fujitake/works/SparseR-CNN/projects/SparseRCNN/sparsercnn/loss.py", line 274, in <listcomp>
    indices = [linear_sum_assignment(c[i]) for i, c in enumerate(C.split(sizes, -1))]
  File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/scipy/optimize/_lsap.py", line 101, in linear_sum_assignment
    a, b = _lsap_module.calculate_assignment(cost_matrix.T)
ValueError: matrix contains invalid numeric entries

@PeizeSun
Copy link
Owner

PeizeSun commented Jun 5, 2021

Can you print out cost_matrix to see which entry is invalid?

@shivamsnaik
Copy link

I am getting the same issue when I try to run Sparse-RCNN with a learning rate of 0.02 for 8 GPUs. Did you find the solution to this problem?.
It would be of great help if you actually did solve the problem. @Swall0w .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants