-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to make output consistent over video? #6
Comments
Thank you for your interest in our project. We are currently in the process of cleaning up the code and will release it once it's ready. Below is a brief description of our implementation:
Please note that achieving video consistency is NOT one of the primary objectives of MoGe. This simplified application is to demonstrate the potential for video reconstruction through a simplified implementation that considers only rigid registration. To enhance consistency, additional optimization techniques would be necessary. See also a related issue(multiview reconstruction application). |
Thanks for your answer! This makes lots of sense! One follow up question: when you say 'rigid transformation' I assume you mean rotation + translation. Does that mean that we assume that the 'scale' is automatically globally consistent that comes out of the MoGe network? Is there a reason it should be consistent? |
The rigid transformation here includes scale, rotation and translation. The raw output scale of MoGe is unconstrained and not consistent across video frames, since it has been trained to be scale-invariant for single images. Our implementation for RANSAC rigid (similarity) registration is quite simple.
The following code snippet solves the transformation (s, R, t) given two sets of 3D points import numpy as np
from typing import *
def rigid_registration(
p: np.ndarray,
q: np.ndarray,
w: np.ndarray = None,
eps: float = 1e-12
) -> Tuple[float, np.ndarray, np.ndarray]:
if w is None:
w = np.ones(p.shape[0])
centroid_p = weighted_mean_numpy(p, w[:, None], axis=0)
centroid_q = weighted_mean_numpy(q, w[:, None], axis=0)
p_centered = p - centroid_p
q_centered = q - centroid_q
w = w / (np.sum(w) + eps)
cov = (w[:, None] * p_centered).T @ q_centered
U, S, Vh = np.linalg.svd(cov)
R = Vh.T @ U.T
if np.linalg.det(R) < 0:
Vh[2, :] *= -1
R = Vh.T @ U.T
scale = np.sum(S) / np.trace((w[:, None] * p_centered).T @ p_centered)
t = centroid_q - scale * (centroid_p @ R.T)
return scale, R, t
def rigid_registration_ransac(
p: np.ndarray,
q: np.ndarray,
w: np.ndarray = None,
max_iters: int = 20,
hypothetical_size: int = 10,
inlier_thresh: float = 0.02
) -> Tuple[float, np.ndarray, np.ndarray]:
n = p.shape[0]
if w is None:
w = np.ones(p.shape[0])
best_score, best_inlines = 0., np.zeros(n, dtype=bool)
best_solution = (np.array(1.), np.eye(3), np.zeros(3))
for _ in range(max_iters):
maybe_inliers = np.random.choice(n, size=hypothetical_size, replace=False)
try:
s, R, t = rigid_registration(p[maybe_inliers], q[maybe_inliers], w[maybe_inliers])
except np.linalg.LinAlgError:
continue
transformed_p = s * p @ R.T + t
errors = w * np.linalg.norm(transformed_p - q, axis=1)
inliers = errors < inlier_thresh
score = inlier_thresh * n - np.clip(errors, None, inlier_thresh).sum()
if score > best_score:
best_score, best_inlines = score, inliers
best_solution = rigid_registration(p[inliers], q[inliers], w[inliers])
return best_solution, best_inlines |
@EasternJournalist Hi, thanks for your great contribution! You mentioned that the scale, rotation, and translation can be solved based on matching results. Is there any recommended library or GitHub repo to realize the RANSAC algorithm? By the way, the traditional methods seldom optimize the "depth scale" of monocular depth as far as I know, how can I consider it as optimizable to the existing code base? Thanks so much for your soon reply! |
@guangkaixu Hi. The previous comment has been updated. The code for RANSAC is now shared in the code snippet. |
Thanks! It will be helpful and I'll have a try. By the way, after I computed depth, camera intrinsic, and pose, is there any appropriate method to perfrom RGB-D fusion? I tried tsdf-fusion, but I'm afraid the shortcomings of the huge GPU memory requirement and the existance of hollow cave of the fused mesh are less satisfactory. |
That is quite tricky. RGB-D fusion is apparently not applicable for large-scale SLAM, especially with dynamic scenes. I am working on to find a convenient alternative too. |
On the website is video results, where the scale / shift are consistent across the video.
If I just run the method per frame, it is very obviously that these are not consistent.
Is there code to make it consistent?
The text was updated successfully, but these errors were encountered: