xGETC2 order of loops in finding max element of submatrix #1021

Goddan-wq · 2024-06-01T15:30:48Z

Hello everyone

I've noticed that order of loops in xGETC2 is not optimal. It goes through the rows of matrix. But our matrices have column layout. Is not it better to change the order of loops, so we get more cache friendly algorithm?

It can cause difference in case, if we have two equal maximum elements in matrix, so we get different LU decomposition and different ipiv/jpiv arrays. But it seems, that both of this results should be correct

For example, see SRC/sgetc2.f, lines 178-187:

XMAX = ZERO 
DO 20 IP = I, N
    DO 10 JP = I, N
        IF( ABS( A( IP, JP ) ).GE.XMAX ) THEN
            XMAX = ABS( A( IP, JP ) )
            IPV = IP
            JPV = JP
        END IF
    CONTINUE
CONTINUE

Changing the order of loops makes IP continuous index, and cache works better

The text was updated successfully, but these errors were encountered:

langou · 2024-06-02T15:22:24Z

These are three good points. (1) Replacing this (I,J) loop with a (J,I) loop should give better performance for column-major matrices. (2) Changing the loops (from (I,J) to (J,I)) might change the chosen pivot in case of a draw between two entries, and so might change the permutation. (3) These outputs (while different) are equally valid complete pivoting factorization P A Q = L U.

It is not clear how much performance gain there would be, if any. That being said, I feel that, whenever possible, in LAPACK, we want to write our loops with column major in mind and so, just for sake of consistency, I feel it is better to have (J,I) loop than (I,J) loop here. It would be nice to know if there is a practical gain in practice.

It is not clear how problematic a routine with (J,I) loop would be in the current software stack. For example, would the (J,I) loop variant pass our own LAPACK Test suite? But more generally would it be a problem for some applications who expect the (I,J) loop in case of a tie. I do not know.

My opinion: All in all, I would be fine with reversing the loops from (I,J) - current, to (J,I) - proposed. If it passes the LAPACK test suite, then I think that should be fine and we could merge this.

Goddan-wq · 2024-06-02T16:42:09Z

Ok than, I will check if tests will be passed after this optimization and make pull request. Otherwise I'll tell that tests are not passed

Actually, i forget to notice couple of things about this optimization. Yeah, firstly it's better for cache. Secondly, when you make 'I' continuous index, it is easier to vectorize this nested loops. As far as I remember, the function spends more than 50% of the time in this nest of loops, so it is pretty hot place. Of course, it depends on architecture, level of optimization of other functions and order of input matrices. But in my practice, I saw that it's hot place. Actually, I have an opportunity to check performance on couple of architectures, I can share results here. Anyway, I think it is important to change the order of loops here

Thanks for your answer

langou · 2024-06-06T18:33:34Z

fixed with #1023

Goddan-wq added the Type: Question label Jun 1, 2024

Reference-LAPACK deleted a comment from umar1908 Jun 2, 2024

Goddan-wq mentioned this issue Jun 6, 2024

changing the order of loop to improve performance #1023

Merged

langou closed this as completed Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xGETC2 order of loops in finding max element of submatrix #1021

xGETC2 order of loops in finding max element of submatrix #1021

Goddan-wq commented Jun 1, 2024 •

edited

Loading

langou commented Jun 2, 2024

Goddan-wq commented Jun 2, 2024

langou commented Jun 6, 2024

xGETC2 order of loops in finding max element of submatrix #1021

xGETC2 order of loops in finding max element of submatrix #1021

Comments

Goddan-wq commented Jun 1, 2024 • edited Loading

langou commented Jun 2, 2024

Goddan-wq commented Jun 2, 2024

langou commented Jun 6, 2024

Goddan-wq commented Jun 1, 2024 •

edited

Loading