Dataset with incomplete combinations #76

brunobian · 2023-09-19T11:52:42Z

Hi! I am using this code to work on a dataset of ~700 words. For each word I am varying several variables (size, font, position, etc) . This results in a too big dataset (+6M instances) to use all the possible combinations during training, so I decided to use a sample of the full dataset. That is pl for the training, but this creates an issue during the evaluation run.

In particular, in evaluate.compute_metrics() I found the first technical issue. To run this method the code tries to reshape samples_zCx and params_zCx tensors using the sizes of the dataset generation factors (lat_sizes) and the latent layer size (latent_dim). This is not a problem when using a dataset with all the possible combinations, but given that I now have a sample of all the possibilities, this is not the case. So, I cannot make the reshape.

I solved this by creating a tensor of np.nan and filling it with the available data in the corresponding cells (using metadata from the dataset that indicates how each instance was created). Technically, this works, but I now have doubts about how this solutions impacts on the following calculations. That is, I now have a tensor with NANs that will be used to compute the conditional entropy H(z|v), is this ok? Would it better to use zeros?

Additionally, computing the conditional entropy with the _estimate_H_zCv() method is pretty computationally expensive given that I have a big tensor full of NANs. Would it be ok to skip the cells with NANs to speedup the process?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset with incomplete combinations #76

Dataset with incomplete combinations #76

brunobian commented Sep 19, 2023

Dataset with incomplete combinations #76

Dataset with incomplete combinations #76

Comments

brunobian commented Sep 19, 2023