-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understanding/Use Case: Clustering of repeated patterns #34
Comments
@ninimama would you mind mentioning what the issue was? I imagine that you resolved the issue as you closed it. |
Actually I didn't resolve the issue. But I couldn't understand whether it is a bug or not. My main concern is: If yes, it means it doesn't take into account the actual magnitude in the patterns. Is that right? So, if I want to find the best matches in my case where actual magnitude matters, it doesn't work for me. Is there any way to add an argument that can give the user the ability to choose whether to use normalized ED or not (at the cost of higher computation time)? |
NOTE: ts_i = ts[i:i+H], where H is the window 24. My data (accessible via the provided link in the previous post) is a time series with length 8760. the values are between 0 and 1. In my data, there is a range where all values are zeros: mydata[421:450]. According to the result: profile['pi'][45] = 425 I attached the plot of ts_45 and ts_425 in the previous post. And, it seems a little bit weird to me such a result. Then I realized that: But, since MatrixProfile uses z-normalized Euclidean Distance, we'll get a different result. I am not sure if this is just for the case where a subsequence is all zeros or such problem can appear in other cases as well since comparing distances of normalized subsequences is different that the comparison of the subsequences themselves. |
The matrix profile uses z-normalization, but it also compares sequences in a sliding window fashion. Magnitude should be a factor here. There is a known issue with the formulation when there is a constant value over the window size causing nans and infs. Generally, adding some random noise to your time-series samples solves this issue. Will you please respond to/attempt the following:
mp.analyze(time_series, windows=H) |
Thanks for the clarification. In fact, I am at the exploration stage, and yes, I am trying to find some sort of repeated patterns. I am trying to see if it makes sense to cluster 365 daily sequences of this 8760-hr data which can be seen in a 365 days-by-24 hours format. Because, according to MatrixProfile, the closest pattern to each daily sequence is not necessarily start at the beginning of a day at 00:00 hr. So, I am trying to see if MatrixProfile can help me better understand the data and if I can come up with a clustering approach by using such information. |
Because, according to MatrixProfile, the closest pattern to each daily sequence is not necessarily start at the beginning of a day at 00:00 hr. Yes, it will not always be exactly at the start/end of a day. It primarily depends on the data. If I understand you correctly, you want to simply cluster many time series together. In this case, you should consider using clustering with MPDist. Coincidentally, I wrote a few code tutorials on how to use MPDist with various clustering algorithms. HDBScan + MPDist is one of my go-to methods right now. Please see the following:
Is this closer to what you are trying to do as an end result? |
That's correct. I want to cluster many time series together. But the problem with MPDist is that it matches sequences according to their similarity of subsequences regardless of their order which is not desirable for my case. So, I am looking for sth like the one mentioned in this paper (section 3.9): I think it is different than the MPDist use cases. |
It seems the link I provided isn't working when I clicked on it. So, here is the name of the paper: The Swiss Army Knife of Time Series Data Mining: Ten |
I understand where you are heading now. You basically tried this approach? If order matters and you are trying to handle warping windows, dynamic time warping may be a good approach. From my experience when you think order matters and that MPDist may not give desirable results, it actually does. Since you are the domain expert, you may be interested in DTW with LB_Keogh: |
In fact, I tried DTW and some of its variants. But I wanted to see if MatrixProfile can bring sth new to the picture. I will check out MSMD. So, if I understand correctly, what you are saying is that there might be a chance that MPDist gives some sort of interesting result. Am I right? So, I will give it a shot. Thanks for sharing your viewpoints on this matter. |
[Update] Here is the code: ` ` |
Would it work to use MPDist for clustering and then sort/recluster them by euclidean distance? |
@ninimama please see the attached PDF where I add some random noise to your time series. I think this will solve some of the issues you are seeing. I took your toy dataset and noticed there are many cases where there are constant values. This tends to impact the results. Adding the noise is a workaround. Let me know if this looks more desirable. |
Hi tylerwarrs, I was finishing my paper and studying git (for my own sake and github) Recently, I did a pull to change snippets, but after checking out the logs, I realize another person (Yui Lu) already modified the algorithm (Please note that the docstring needs to be updated accordingly since a new key "neighbor" is added to the dictionary) ========================================= Thanks for the PDF. In fact, I was trying to see if I can extend the material/concept you provided but I couldn't. Although it is good for extracting similar patters, it doesn't cover the whole sequence. So, my goal is to see if I can extend MP to cluster and cover the whole dataset by finding similar patterns (and maybe allowing some number of elements to be outliers.) I will start to think about such an algorithm after finalizing my current work. I haven't used the MatrixProfile concept yet in my current work and I am working in parallel! Unfortunately, I am not a good multi-tasking guy! |
Hi,
I recently found sth strange while using a matrix profile.
I used the data (1d-array: 8760 hourly data between 0 and 1, so not normalized) saved into this CSV file:
https://drive.google.com/file/d/1dZgaEZjD2HsnXu_wWaGKPCPRbDJ1_dty/view?usp=sharing
And here is my simple code:
`
#mydata: read from csv file (with length: 8760)
H=24
profile = mp.compute(mydata, windows=[H])
'
'
ind_min = np.argmin(profile['mp'])
ind_min_pair = profile['pi'][ind_min]
print('min of dist [best match of all]', profile['mp'][ind_min])
plt.plot(mydata[ind_min:ind_min+H])
plt.plot(mydata[ind_min_pair : ind_min_pair+H])
plt.show()
`
So, the goal was to find the closest pair of sequences (with length H=24). And here is the result:
It is noteworthy to mention that I also checked out the plot for the first few indices (the matches for index from 0 to 20) and the plots were good. However, when I tried to get the best match of all, I got the result attached above!
The same problem exists for these two pair of indices as well: (44,424) and (46,426), with min dist=0
or (63,425):
Am I missing sth here? Isn't MP 1-dim array is: np.min(distance_matrix,axis=1)? (where distance_matrix[i,j] is the Euclidean distance between ts[i,i+H] and ts[j,j+H]) ???
If I understand correctly, mp calculated is the Euclidean Distance between normalized sequences. However, mydata[421:421+H] is all zeros. Do you know how the algorithm treats a situation where x (one of the subsequences in the time series) is all zeros: x_norm = (x-0) / 0 ???
##############################################################################################
I tried to figure out how muinvn function works but I couldn't.
The text was updated successfully, but these errors were encountered: