How to make output consistent over video? #6

JonathonLuiten · 2024-10-30T14:14:44Z

On the website is video results, where the scale / shift are consistent across the video.

If I just run the method per frame, it is very obviously that these are not consistent.

Is there code to make it consistent?

EasternJournalist · 2024-10-31T07:09:12Z

Thank you for your interest in our project. We are currently in the process of cleaning up the code and will release it once it's ready. Below is a brief description of our implementation:

We maintain a list of point maps registered in world space.
For each frame in the video sequence:
- Estimate the camera-space point map of the current frame using MoGe.
- Select a few previous frames as references and compute dense image matching using PDCNet between the current frame and the reference frames. (Alternative robust image matching methods can also be used if you have more options)
- [KEY] Solve for point-set rigid transformation (1-DoF scale, 3-DoF rotation and 3-DoF translation) based on the matching results. RANSAC is employed to remove outliers. Code snippet for this part sees How to make output consistent over video? #6 (comment)
- Apply the calculated rigid transformation to the current frame's point map and append it to the registered point maps list.

Please note that achieving video consistency is NOT one of the primary objectives of MoGe. This simplified application is to demonstrate the potential for video reconstruction through a simplified implementation that considers only rigid registration. To enhance consistency, additional optimization techniques would be necessary.

See also a related issue(multiview reconstruction application).

JonathonLuiten · 2024-10-31T18:09:12Z

Thanks for your answer!

This makes lots of sense!

One follow up question: when you say 'rigid transformation' I assume you mean rotation + translation. Does that mean that we assume that the 'scale' is automatically globally consistent that comes out of the MoGe network? Is there a reason it should be consistent?

EasternJournalist · 2024-11-01T07:09:36Z

Thanks for your answer!

This makes lots of sense!

One follow up question: when you say 'rigid transformation' I assume you mean rotation + translation. Does that mean that we assume that the 'scale' is automatically globally consistent that comes out of the MoGe network? Is there a reason it should be consistent?

The rigid transformation here includes scale, rotation and translation. The raw output scale of MoGe is unconstrained and not consistent across video frames, since it has been trained to be scale-invariant for single images.

Our implementation for RANSAC rigid (similarity) registration is quite simple.

$p_i$: the current frame camera-space point;
$q_i$: matched reference frame world-space point.
$w_i$: inversely proportional to its depth.

The following code snippet solves the transformation (s, R, t) given two sets of 3D points $\{p_i\}_{i=1}^N,\{q_i\}_{i=1}^N$ and weighting $\{w_i\}_{i=1}^N$.

$$ \min_{s,\bf R,\bf t}\sum_{i=1}^Nw_i\Vert s\bf R\bf p_i+t-\bf q_i\Vert_2^2 $$

import numpy as np
from typing import *

def rigid_registration(
    p: np.ndarray, 
    q: np.ndarray, 
    w: np.ndarray = None, 
    eps: float = 1e-12
) -> Tuple[float, np.ndarray, np.ndarray]:
    if w is None:
        w = np.ones(p.shape[0])
    centroid_p = weighted_mean_numpy(p, w[:, None], axis=0)
    centroid_q = weighted_mean_numpy(q, w[:, None], axis=0)

    p_centered = p - centroid_p
    q_centered = q - centroid_q
    w = w / (np.sum(w) + eps)
        
    cov = (w[:, None] * p_centered).T @ q_centered
    U, S, Vh = np.linalg.svd(cov)
    R = Vh.T @ U.T
    if np.linalg.det(R) < 0:
        Vh[2, :] *= -1
        R = Vh.T @ U.T
    scale = np.sum(S) / np.trace((w[:, None] * p_centered).T @ p_centered)
    t = centroid_q - scale * (centroid_p @ R.T)
    return scale, R, t


def rigid_registration_ransac(
    p: np.ndarray,
    q: np.ndarray,
    w: np.ndarray = None,
    max_iters: int = 20,
    hypothetical_size: int = 10,
    inlier_thresh: float = 0.02
) -> Tuple[float, np.ndarray, np.ndarray]:
    n = p.shape[0]
    if w is None:
        w = np.ones(p.shape[0])
    
    best_score, best_inlines = 0., np.zeros(n, dtype=bool)
    best_solution = (np.array(1.), np.eye(3), np.zeros(3))

    for _ in range(max_iters):
        maybe_inliers = np.random.choice(n, size=hypothetical_size, replace=False)
        try:
            s, R, t = rigid_registration(p[maybe_inliers], q[maybe_inliers], w[maybe_inliers])
        except np.linalg.LinAlgError:
            continue
        transformed_p = s * p @ R.T + t
        errors = w * np.linalg.norm(transformed_p - q, axis=1)
        inliers = errors < inlier_thresh
        
        score = inlier_thresh * n - np.clip(errors, None, inlier_thresh).sum()
        if  score > best_score:
            best_score, best_inlines = score, inliers
            best_solution = rigid_registration(p[inliers], q[inliers], w[inliers])
    
    return best_solution, best_inlines

guangkaixu · 2024-11-06T09:06:12Z

@EasternJournalist Hi, thanks for your great contribution! You mentioned that the scale, rotation, and translation can be solved based on matching results. Is there any recommended library or GitHub repo to realize the RANSAC algorithm? By the way, the traditional methods seldom optimize the "depth scale" of monocular depth as far as I know, how can I consider it as optimizable to the existing code base? Thanks so much for your soon reply!

EasternJournalist · 2024-11-07T07:26:58Z

@guangkaixu Hi. The previous comment has been updated. The code for RANSAC is now shared in the code snippet.

guangkaixu · 2024-11-11T06:25:45Z

Thanks! It will be helpful and I'll have a try.

By the way, after I computed depth, camera intrinsic, and pose, is there any appropriate method to perfrom RGB-D fusion? I tried tsdf-fusion, but I'm afraid the shortcomings of the huge GPU memory requirement and the existance of hollow cave of the fused mesh are less satisfactory.

EasternJournalist · 2024-11-19T06:32:07Z

Thanks! It will be helpful and I'll have a try.

By the way, after I computed depth, camera intrinsic, and pose, is there any appropriate method to perfrom RGB-D fusion? I tried tsdf-fusion, but I'm afraid the shortcomings of the huge GPU memory requirement and the existance of hollow cave of the fused mesh are less satisfactory.

That is quite tricky. RGB-D fusion is apparently not applicable for large-scale SLAM, especially with dynamic scenes. I am working on to find a convenient alternative too.

This was referenced Nov 1, 2024

Could we use MoGe for multi view 3d reconstruction? #7

Closed

Exporting the camera path? #12

Closed

EasternJournalist pinned this issue Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to make output consistent over video? #6

How to make output consistent over video? #6

JonathonLuiten commented Oct 30, 2024 •

edited

Loading

EasternJournalist commented Oct 31, 2024 •

edited

Loading

JonathonLuiten commented Oct 31, 2024

EasternJournalist commented Nov 1, 2024 •

edited

Loading

guangkaixu commented Nov 6, 2024

EasternJournalist commented Nov 7, 2024 •

edited

Loading

guangkaixu commented Nov 11, 2024

EasternJournalist commented Nov 19, 2024

How to make output consistent over video? #6

How to make output consistent over video? #6

Comments

JonathonLuiten commented Oct 30, 2024 • edited Loading

EasternJournalist commented Oct 31, 2024 • edited Loading

JonathonLuiten commented Oct 31, 2024

EasternJournalist commented Nov 1, 2024 • edited Loading

guangkaixu commented Nov 6, 2024

EasternJournalist commented Nov 7, 2024 • edited Loading

guangkaixu commented Nov 11, 2024

EasternJournalist commented Nov 19, 2024

JonathonLuiten commented Oct 30, 2024 •

edited

Loading

EasternJournalist commented Oct 31, 2024 •

edited

Loading

EasternJournalist commented Nov 1, 2024 •

edited

Loading

EasternJournalist commented Nov 7, 2024 •

edited

Loading