Skip to content

Latest commit

 

History

History
85 lines (46 loc) · 10.5 KB

CVPR-2019-TL;DR.md

File metadata and controls

85 lines (46 loc) · 10.5 KB

Screenshot_2019-06-17_at_11.10.03

A subset of CVPR 2019 papers worth having a look at. Didn't have time to read them or read tweets about them ? Here you'll find a TL;DR version of subset (~30) of paper(s). Assuming you know the state of the art for the problem investigated in the paper.

DONE : 3/32

Visual tracking as similarity search. Extract features from a query q-bbox of car f.ex and compute a distance metric over all spatial locations in a target image. Penalise all locations where q-bbox and target t-bbox don't match. For this one would need annotation of the object being tracked. One could use a ranking objective or product regression targets and compute IOU loss. Why use annotation when you can play the video forwards and backwards. A good tracker should do both. Given a query q-bbox find the best t-bbox in another video frame. Use the t-bbox as query now and a good tracker should be able to recover the q-bbox. If it doesn't penalise with IOU loss.

Yes. Models which do better on a base task (Task-1) transfer nicely to another (Task-2). Training MobileNetV2 on ILSVRC2012 (Task-1) and fine-tuning on MS COCO for Semseg (Task-2). Semseg would do better if you used a better base model f.ex ResNet101 which has higher accuracy on ILSVRC2012. Where does this break down

  • If Task-2 is not a natural extension of Task-1. F.ex Image Aesthetics
  • Dataset For Task-2 is as large as for Task-1
  • Regularisation applied for Task-1 (weight decay, weight norm) harms Task-2 Why does this happen - No clear answer : Authors speculate large canvas over-paramaterised networks are better at finding the plateau which is suitable for Task-2

Compressing 3D point clouds while preserving downstream task accuracy. Using a model which process 3D point clouds. First convert each 3D point to a feature vector using 1x1 convolutions. Run global max-pooling feature wise. Add few dense layers and generate sampled 3D point cloud. Nutshell - 3D input Nx3 to PointNet output Kx3. The generated point clouds is not guaranteed to be a subset of input. Do a NN matching with L2 distance to match. The matching process is only applied at inference time, as the final step of inference. During training, the gener- ated points are processed by the task network as-is, since the matching is not differentiable and cannot propagate the task loss. What if you want sampling size K as part of the network ? Train an network which input & output is Nx3 except that the points are ordered according to their importance in minimising a downstream task. YOLO

Screenshot_2019-06-19_at_11.02.24 Naive triplet : Sample [+, anchor, -] == triplet. Pull + and anchor inside a sphere of diameter alpha - margin. Push -ives outside a sphere of radius alpha. This fails when all your triplet samples become easy to separate and loss is almost zero and your averaged weight gradient over a mini-batch doesn't move the parameters of the DL model. Try other idea (N-pair-mc), build a smarter mini-batch. Take pairs from N different classes and build triplets on the fly. Use positives from different classes as Negatives. This can further improved by instead of pushing all -ives away push a point aways which represents them all. Heuristically computed as the point closest to the + sample (proxy-NCA). Use anchor as positive and positive as anchor and you get Lifted Struct. Main idea : In a batch you have multiple samples of each class. Get a + query find all violations of samples belonging to the same class and from -ive set. Computed standard triplet loss for all +ive violations and weighted loss for all -ive violations. weighted by the margin of violation (to get hard negative). Viola you done !