-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Online quantization algorithm for gudhi #536
base: master
Are you sure you want to change the base?
Conversation
Co-authored-by: Vincent Rouvreau <10407034+VincentRouvreau@users.noreply.github.com>
Co-authored-by: Vincent Rouvreau <10407034+VincentRouvreau@users.noreply.github.com>
Co-authored-by: Vincent Rouvreau <10407034+VincentRouvreau@users.noreply.github.com>
…evel into quantization_v2
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
Corrected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess keeping it in wasserstein/ is ok.
different tori with some additional noise. | ||
Starting from an initial codebook ``c0``, centroids are iteratively updated as new diagrams are provided. | ||
As we use the standard metrics between persistence diagrams (denoted here by :math:`\mathrm{OT}_2`), points in the | ||
diagrams that are close to the diagonal do not interfere in the codebook update process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it is the same as having an implicit point on the diagonal in the codebook?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More precisely, having a point in the codebook that represents "all the points on the diagonal" (or, formally, looking at the quotient space where you identify the points on the diagonal).
Co-authored-by: Marc Glisse <marc.glisse@inria.fr>
Co-authored-by: Marc Glisse <marc.glisse@inria.fr>
I just realized that I never managed to do the last requested modifications (my local build was broken for some reason at that time). PS : and one day later I realize that I forgot to post this comment... 😴 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The algorithm is presented as on online algorithm. So it should be normal to give it some data, look at the codebook at that point, pass it more data, look at the updated codebook, etc. The init
parameter could be used towards that goal, but the number of diagrams (or batches) already processed is forgotten, and indeed t
(the learning rate) is reset to 0 at every call.
(the two loops generating the tori). | ||
|
||
.. figure:: | ||
./img/quantiz.gif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the one hand, the GIF is cool. On the other hand, I have trouble reading the doc with that thing moving on my screen...
if withdiag: | ||
a = np.argmin(M[:-1, :], axis=1) | ||
else: | ||
a = np.argmin(M[:-1, :-1], axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels a bit strange to call _build_dist_matrix
, whose main difference with cdist
is that it adds the diagonal, just to drop the diagonal immediately... But I don't think it really matters.
X_batch = np.concatenate(list_of_non_empty_diags) | ||
return X_batch | ||
else: | ||
return np.array([]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is sometimes useful to force the shape of empty arrays, to (0,2)
for instance. I don't know if that's the case here.
:param internal_p: Ground metric to assess centroid affectation. Default is ``2.``. | ||
:type internal_p: ``float`` | ||
|
||
:returns: The final codebook obtained after going through the all pdiagset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:rtype: kx2 numpy array
?
|
||
def _init_c(pdiagset, k, internal_p=2): | ||
""" | ||
A naive heuristic to initialize a codebook: we take the k points with largest distances to the diagonal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the first diagram has fewer than k points?
:param batch_size: Size of batches used during the online exploration of the ``pdiagset``. | ||
Default is ``1``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a user, should I stick to the default value of 1? If I already have all the diagrams, I may think that I don't need an online algorithm, which is for when data appears progressively, and consider using one huge batch under the impression that it disables the "online" stuff and gets the best result.
# stochastic-gradient-descent like approach (decreasing learning rate). | ||
c_current[j] = c_current[j] - grad / (t + 1) | ||
else: | ||
raise NotImplemented('Order = %s is not available yet. Only order=2. is valid' %order) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could error out earlier (or not provide this option at all and just say that it is W2).
Provide a quantization algorithm to "summarize" a collection of persistence diagrams.
(At least) One thing that may be discussed :
python/gudhi/wasserstein/
repo, because it is of a "Wasserstein metric" flavor (we minimize something in terms of Wasserstein distance between persistence diagrams). However, it does not rely on POT as other functions in this repo do ; we actually never need to explicitly compute a Wasserstein distance/matching explicitly. Perhaps would it belong directly to thegudhi/
repo ?Also TODO :
quantization.py
: is the copyright correct?