Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation differences compared to prior work #11

Open
jefftan969 opened this issue Nov 4, 2024 · 2 comments
Open

Evaluation differences compared to prior work #11

jefftan969 opened this issue Nov 4, 2024 · 2 comments

Comments

@jefftan969
Copy link

Thanks for your great work, the results are amazing!

Just curious why the evaluation tables in MoGe often have different baseline numbers than the numbers reported in the original papers?

Here are some examples:

DUSt3R comparisons on NYUv2
In MoGe paper, Table 2 (scale-invariant pointmap): Rel 5.56, Delta1 97.1
In MoGe paper, Table 2 (affine-invariant pointmap): Rel 4.49, Delta1 97.4
In MoGe paper, Table 3 (scale-invariant depth): Rel 4.43, Delta1 97.1
In DUSt3R paper, Table 2 (depth): Rel 6.50, Delta 94.09

DUSt3R comparisons on KITTI
In MoGe paper, Table 2 (scale-invariant pointmap): Rel 21.9, Delta1 63.6
In MoGe paper, Table 2 (affine-invariant pointmap): Rel 18.0, Delta1 66.7
In MoGe paper, Table 3 (scale-invariant depth): Rel 7.71, Delta1 90.9
In DUSt3R paper, Table 2 (depth 512): Rel 10.74, Delta 86.60

Marigold comparisons on NYUv2
In MoGe paper, Table 3 (affine-invariant depth): Rel 4.63, Delta1 97.3
In Marigold paper, Table 1 (depth w/ ensemble): Rel 5.5, Delta1 96.4

Marigold comparisons on KITTI
In MoGe paper, Table 3 (affine-invariant depth): Rel 7.29, Delta1 93.8
In Marigold paper, Table 1 (depth w/ ensemble): Rel 9.9, Delta1 91.6

Marigold comparisons on ETH3D
In MoGe paper, Table 3 (affine-invariant depth): Rel 6.08, Delta1 96.3
In Marigold paper, Table 1 (depth w/ ensemble): Rel 6.5, Delta1 96.0

Marigold comparisons on DIODE
In MoGe paper, Table 3 (affine-invariant depth): Rel 6.34, Delta1 94.3
In Marigold paper, Table 1 (depth w/ ensemble): Rel 30.8, Delta1 77.3

DepthAnythingV2 comparisons on Sintel
In MoGe paper, Table 3 (affine-invariant disparity): Rel 21.4, Delta1 72.8
In DepthAnythingV2 paper, Table 5 (take their best result): Rel 48.7, Delta1 75.2

@EasternJournalist
Copy link
Collaborator

EasternJournalist commented Nov 5, 2024

Hi. Thanks for your interest and this valuable question. The evaluation results are different because the datasets are processed differently from these works. We have meticulously processed the raw evaluation datasets to assure reliability of ground truth data (e.g., removing inaccurate regions & cropping).

To maintain a fair comparison, we re-evaluated all baselines in this paper with the same processed data, under their default/recommended settings, rather than simply adopting the performance metrics reported in their original papers.

Each dataset underwent specific processing to ensure a reliable evaluation. Full details are provided in Section B.2 of our supplementary material https://arxiv.org/pdf/2410.19115.

For instance, we omit sky regions in the Sintel dataset because sky depth is not quantifiable—the "ground truth depth" could be any large value, such as 34, 50, or even beyond 2000. Evaluating models with sky depth included is not meaningful, which might be why you observe 48.7% AbsRel fromDepthAnythingV2 (which is a incredibly large relative error!), while their 75.2% Delta1 is normal - In some test cases, their predicted depths align with the sky, disregarding the actual foreground objects.
Image

Another example is the DIODE dataset, where we remove boundary artifacts. This practice is also adopted by some previous works (e.g., UniDepth) to prevent significant bias from ground truth artifacts. In contrast, Marigold may not have applied such preprocessing, which could explain why their reported 30.8% AbsRel is potentially misleading.
Image

We are committed to ensuring that our evaluation process is both reproducible and transparent. To that end, we plan to release our code for evaluation. Please stay tuned for further announcements regarding its availability.

@EasternJournalist
Copy link
Collaborator

Apart from the differences in data processing, it is also crucial to consider the evaluation configurations, such as 'metric depth', 'scale-invariant depth', 'affine-invariant depth', and 'affine-invariant disparity (inverse depth)'. These terms refer to various ways of aligning the predicted depth values with the ground truth, considering factors like scale and shift adjustments. The key distinction lies in whether and how the scale and offset are calibrated against the ground truth.

For example, 'scale-invariant' measures adjust for any uniform scaling in depth predictions, while 'affine-invariant' methods compensate for both scaling and shifting. Each configuration can yield significantly different results due to these adjustments. Thus, it's important to understand that the performance numbers reported are only directly comparable when they are calculated under the same evaluation framework. Ensuring consistency in this aspect is essential for fair and meaningful comparisons across different methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants