Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xGETC2 order of loops in finding max element of submatrix #1021

Closed
Goddan-wq opened this issue Jun 1, 2024 · 3 comments
Closed

xGETC2 order of loops in finding max element of submatrix #1021

Goddan-wq opened this issue Jun 1, 2024 · 3 comments

Comments

@Goddan-wq
Copy link
Contributor

Goddan-wq commented Jun 1, 2024

Hello everyone

I've noticed that order of loops in xGETC2 is not optimal. It goes through the rows of matrix. But our matrices have column layout. Is not it better to change the order of loops, so we get more cache friendly algorithm?

It can cause difference in case, if we have two equal maximum elements in matrix, so we get different LU decomposition and different ipiv/jpiv arrays. But it seems, that both of this results should be correct

For example, see SRC/sgetc2.f, lines 178-187:

XMAX = ZERO 
DO 20 IP = I, N
    DO 10 JP = I, N
        IF( ABS( A( IP, JP ) ).GE.XMAX ) THEN
            XMAX = ABS( A( IP, JP ) )
            IPV = IP
            JPV = JP
        END IF
    CONTINUE
CONTINUE

Changing the order of loops makes IP continuous index, and cache works better

@langou
Copy link
Contributor

langou commented Jun 2, 2024

These are three good points. (1) Replacing this (I,J) loop with a (J,I) loop should give better performance for column-major matrices. (2) Changing the loops (from (I,J) to (J,I)) might change the chosen pivot in case of a draw between two entries, and so might change the permutation. (3) These outputs (while different) are equally valid complete pivoting factorization P A Q = L U.

It is not clear how much performance gain there would be, if any. That being said, I feel that, whenever possible, in LAPACK, we want to write our loops with column major in mind and so, just for sake of consistency, I feel it is better to have (J,I) loop than (I,J) loop here. It would be nice to know if there is a practical gain in practice.

It is not clear how problematic a routine with (J,I) loop would be in the current software stack. For example, would the (J,I) loop variant pass our own LAPACK Test suite? But more generally would it be a problem for some applications who expect the (I,J) loop in case of a tie. I do not know.

My opinion: All in all, I would be fine with reversing the loops from (I,J) - current, to (J,I) - proposed. If it passes the LAPACK test suite, then I think that should be fine and we could merge this.

@Reference-LAPACK Reference-LAPACK deleted a comment from umar1908 Jun 2, 2024
@Reference-LAPACK Reference-LAPACK deleted a comment from umar1908 Jun 2, 2024
@Reference-LAPACK Reference-LAPACK deleted a comment from umar1908 Jun 2, 2024
@Goddan-wq
Copy link
Contributor Author

Ok than, I will check if tests will be passed after this optimization and make pull request. Otherwise I'll tell that tests are not passed

Actually, i forget to notice couple of things about this optimization. Yeah, firstly it's better for cache. Secondly, when you make 'I' continuous index, it is easier to vectorize this nested loops. As far as I remember, the function spends more than 50% of the time in this nest of loops, so it is pretty hot place. Of course, it depends on architecture, level of optimization of other functions and order of input matrices. But in my practice, I saw that it's hot place. Actually, I have an opportunity to check performance on couple of architectures, I can share results here. Anyway, I think it is important to change the order of loops here

Thanks for your answer

@langou
Copy link
Contributor

langou commented Jun 6, 2024

fixed with #1023

@langou langou closed this as completed Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants