Replies: 2 comments
-
Your request is very reasonable. I don't know the answer to this, so here are some points which come to mind:
Here is the original paper: but you already tried to research this yourself
you should also look into the other Depth estimator (Zoe, Leres and what not..). i think they have higher resolution already. |
Beta Was this translation helpful? Give feedback.
-
apparently there is a bug in the controlnet extension right now: Fix 16-bit grayscale control image conversion |
Beta Was this translation helpful? Give feedback.
-
The depth map image is displayed as a grayscale image, with R, G, and B all being equal. I assume that's the representation used to control the image generation. If so, that means there are only 256 depth values. I was wondering if it might be possible to use the concatenated RGB values to allow for up to 24 bits of precision. I realize there are a number of steps in the process that might prevent this; however, if it were possible to increase the depth precision it would greatly benefit the common situation of a person in the foreground with objects in the distance.
EDIT: I tried to find out whether the MiDaS depth estimator (which is what I believe is used) would support returning a higher precision result, but didn't understand what I read well enough to answer that question. I did find that actually it returns the reciprocal of the depth, which is somewhat interesting.
EDIT: Since I've gotten no responses to my idea, even to say (rightly or wrongly) it's inane, I thought I'd at least explain my line of thinking.
I don't know a lot about GPUs, but from what I do know, I believe most of their calculations are done using floating-point arithmetic. I therefore think it's likely that the output of the depth-estimator preprocessor, and the input to the ControlNet model that uses the computed depth values, are floating-point numbers; and if the data are conveyed by the image, they're converted from FPs into 8-bit unsigned integers to store in the image, and from integers to FPs to be processed by the model.
Possibly the model directly uses the output of the preprocessor, and the grayscale depth image is only intended as a user-friendly representation of the underlying data. If that's so, the answer to the question in my comment title is "No" -- the full precision of the depth data is already being used.
If, however, the depth image is the output of the preprocessor and the input to the ControlNet model, then unless the precision of the depth preprocessor is less than 1 part in 256 throughout its range, the image generation would likely benefit from a more precise representation of the depth values.
The easy way to do that would be to treat the concatenated RGB components of the image as 24-bit numbers. That would no doubt have far more precision than required. Perhaps, though, it's considered desirable to have the image provide an intuitive representation of the data. If so, that makes representing the depths values more challenging. One suggestion is to use a pseudo-color scheme. For instance, the rainbow-like progression Black->Red->Yellow->Green->Cyan->Blue->Magenta->White could provide seven times the precision, which is 1 part in 1792.
Beta Was this translation helpful? Give feedback.
All reactions