-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xGETC2 order of loops in finding max element of submatrix #1021
Comments
These are three good points. (1) Replacing this (I,J) loop with a (J,I) loop should give better performance for column-major matrices. (2) Changing the loops (from (I,J) to (J,I)) might change the chosen pivot in case of a draw between two entries, and so might change the permutation. (3) These outputs (while different) are equally valid complete pivoting factorization P A Q = L U. It is not clear how much performance gain there would be, if any. That being said, I feel that, whenever possible, in LAPACK, we want to write our loops with column major in mind and so, just for sake of consistency, I feel it is better to have (J,I) loop than (I,J) loop here. It would be nice to know if there is a practical gain in practice. It is not clear how problematic a routine with (J,I) loop would be in the current software stack. For example, would the (J,I) loop variant pass our own LAPACK Test suite? But more generally would it be a problem for some applications who expect the (I,J) loop in case of a tie. I do not know. My opinion: All in all, I would be fine with reversing the loops from (I,J) - current, to (J,I) - proposed. If it passes the LAPACK test suite, then I think that should be fine and we could merge this. |
Ok than, I will check if tests will be passed after this optimization and make pull request. Otherwise I'll tell that tests are not passed Actually, i forget to notice couple of things about this optimization. Yeah, firstly it's better for cache. Secondly, when you make 'I' continuous index, it is easier to vectorize this nested loops. As far as I remember, the function spends more than 50% of the time in this nest of loops, so it is pretty hot place. Of course, it depends on architecture, level of optimization of other functions and order of input matrices. But in my practice, I saw that it's hot place. Actually, I have an opportunity to check performance on couple of architectures, I can share results here. Anyway, I think it is important to change the order of loops here Thanks for your answer |
fixed with #1023 |
Hello everyone
I've noticed that order of loops in xGETC2 is not optimal. It goes through the rows of matrix. But our matrices have column layout. Is not it better to change the order of loops, so we get more cache friendly algorithm?
It can cause difference in case, if we have two equal maximum elements in matrix, so we get different LU decomposition and different ipiv/jpiv arrays. But it seems, that both of this results should be correct
For example, see SRC/sgetc2.f, lines 178-187:
Changing the order of loops makes IP continuous index, and cache works better
The text was updated successfully, but these errors were encountered: