Optimize compression by avoiding unpredictable branches #4144

TocarIP · 2024-09-20T20:55:13Z

Avoid unpredictable branch. Use conditional move to generate the address that is guaranteed to be safe and compare unconditionally. Instead of

if (idx < limit && x[idx] == val ) // mispredicted idx < limit branch

Do

addr = cmov(safe,x+idx)
if (*addr == val && idx < limit) // almost always false so well predicted

Using microbenchmarks from https://github.com/google/fleetbench, I get about ~10% speed-up:

name old cpu/op new cpu/op delta
BM_ZSTD_COMPRESS_Fleet/compression_level:-7/window_log:15 1.46ns ± 3% 1.31ns ± 7% -9.88% (p=0.000 n=35+38)
BM_ZSTD_COMPRESS_Fleet/compression_level:-7/window_log:16 1.41ns ± 3% 1.28ns ± 3% -9.56% (p=0.000 n=36+39)
BM_ZSTD_COMPRESS_Fleet/compression_level:-5/window_log:15 1.61ns ± 1% 1.43ns ± 3% -10.70% (p=0.000 n=30+39)
BM_ZSTD_COMPRESS_Fleet/compression_level:-5/window_log:16 1.54ns ± 2% 1.39ns ± 3% -9.21% (p=0.000 n=37+39)
BM_ZSTD_COMPRESS_Fleet/compression_level:-3/window_log:15 1.82ns ± 2% 1.61ns ± 3% -11.31% (p=0.000 n=37+40)
BM_ZSTD_COMPRESS_Fleet/compression_level:-3/window_log:16 1.73ns ± 3% 1.56ns ± 3% -9.50% (p=0.000 n=38+39)
BM_ZSTD_COMPRESS_Fleet/compression_level:-1/window_log:15 2.12ns ± 2% 1.79ns ± 3% -15.55% (p=0.000 n=34+39)
BM_ZSTD_COMPRESS_Fleet/compression_level:-1/window_log:16 1.99ns ± 3% 1.72ns ± 3% -13.70% (p=0.000 n=38+38)
BM_ZSTD_COMPRESS_Fleet/compression_level:0/window_log:15 3.22ns ± 3% 2.94ns ± 3% -8.67% (p=0.000 n=38+40)
BM_ZSTD_COMPRESS_Fleet/compression_level:0/window_log:16 3.19ns ± 4% 2.86ns ± 4% -10.55% (p=0.000 n=40+38)
BM_ZSTD_COMPRESS_Fleet/compression_level:1/window_log:15 2.60ns ± 3% 2.22ns ± 3% -14.53% (p=0.000 n=40+39)
BM_ZSTD_COMPRESS_Fleet/compression_level:1/window_log:16 2.46ns ± 3% 2.13ns ± 2% -13.67% (p=0.000 n=39+36)
BM_ZSTD_COMPRESS_Fleet/compression_level:2/window_log:15 2.69ns ± 3% 2.46ns ± 3% -8.63% (p=0.000 n=37+39)
BM_ZSTD_COMPRESS_Fleet/compression_level:2/window_log:16 2.63ns ± 3% 2.36ns ± 3% -10.47% (p=0.000 n=40+40)
BM_ZSTD_COMPRESS_Fleet/compression_level:3/window_log:15 3.20ns ± 2% 2.95ns ± 3% -7.94% (p=0.000 n=35+40)
BM_ZSTD_COMPRESS_Fleet/compression_level:3/window_log:16 3.20ns ± 4% 2.87ns ± 4% -10.33% (p=0.000 n=40+40)

I've also measured the impact on internal workloads and saw similar ~10% improvement in performance, measured by cpu usage/byte of data.

Avoid unpredictable branch. Use conditional move to generate the address that is guaranteed to be safe and compare unconditionally. Instead of if (idx < limit && x[idx] == val ) // mispredicted idx < limit branch Do addr = cmov(safe,x+idx) if (*addr == val && idx < limit) // almost always false so well predicted Using microbenchmarks from https://github.com/google/fleetbench, I get about ~10% speed-up: name old cpu/op new cpu/op delta BM_ZSTD_COMPRESS_Fleet/compression_level:-7/window_log:15 1.46ns ± 3% 1.31ns ± 7% -9.88% (p=0.000 n=35+38) BM_ZSTD_COMPRESS_Fleet/compression_level:-7/window_log:16 1.41ns ± 3% 1.28ns ± 3% -9.56% (p=0.000 n=36+39) BM_ZSTD_COMPRESS_Fleet/compression_level:-5/window_log:15 1.61ns ± 1% 1.43ns ± 3% -10.70% (p=0.000 n=30+39) BM_ZSTD_COMPRESS_Fleet/compression_level:-5/window_log:16 1.54ns ± 2% 1.39ns ± 3% -9.21% (p=0.000 n=37+39) BM_ZSTD_COMPRESS_Fleet/compression_level:-3/window_log:15 1.82ns ± 2% 1.61ns ± 3% -11.31% (p=0.000 n=37+40) BM_ZSTD_COMPRESS_Fleet/compression_level:-3/window_log:16 1.73ns ± 3% 1.56ns ± 3% -9.50% (p=0.000 n=38+39) BM_ZSTD_COMPRESS_Fleet/compression_level:-1/window_log:15 2.12ns ± 2% 1.79ns ± 3% -15.55% (p=0.000 n=34+39) BM_ZSTD_COMPRESS_Fleet/compression_level:-1/window_log:16 1.99ns ± 3% 1.72ns ± 3% -13.70% (p=0.000 n=38+38) BM_ZSTD_COMPRESS_Fleet/compression_level:0/window_log:15 3.22ns ± 3% 2.94ns ± 3% -8.67% (p=0.000 n=38+40) BM_ZSTD_COMPRESS_Fleet/compression_level:0/window_log:16 3.19ns ± 4% 2.86ns ± 4% -10.55% (p=0.000 n=40+38) BM_ZSTD_COMPRESS_Fleet/compression_level:1/window_log:15 2.60ns ± 3% 2.22ns ± 3% -14.53% (p=0.000 n=40+39) BM_ZSTD_COMPRESS_Fleet/compression_level:1/window_log:16 2.46ns ± 3% 2.13ns ± 2% -13.67% (p=0.000 n=39+36) BM_ZSTD_COMPRESS_Fleet/compression_level:2/window_log:15 2.69ns ± 3% 2.46ns ± 3% -8.63% (p=0.000 n=37+39) BM_ZSTD_COMPRESS_Fleet/compression_level:2/window_log:16 2.63ns ± 3% 2.36ns ± 3% -10.47% (p=0.000 n=40+40) BM_ZSTD_COMPRESS_Fleet/compression_level:3/window_log:15 3.20ns ± 2% 2.95ns ± 3% -7.94% (p=0.000 n=35+40) BM_ZSTD_COMPRESS_Fleet/compression_level:3/window_log:16 3.20ns ± 4% 2.87ns ± 4% -10.33% (p=0.000 n=40+40) I've also measured the impact on internal workloads and saw similar ~10% improvement in performance, measured by cpu usage/byte of data.

danlark1 · 2024-09-21T13:12:50Z

Note that we observed the most improvements on window log 15, 16. Window logs from 18 see little to no improvement which should be reasonable given the hashtable sizes

And 10% improvement comes from running in Google production for several months

Cyan4973 · 2024-09-24T00:53:35Z

I had a look into this, and unfortunately observed a performance regression with this patch on the test system:

core i7-9700k (noturbo), Ubuntu 24.04, gcc 13.2, silesia.tar:

level	`dev`	this `PR`	difference	comment
-3	493	484	-9
-2	466	456	-10
-1	421	403	-18
1	386	371	-15
2	290	284	-6	difference is lowered when data access pattern reaches L3 cache
3	208	202	-6
4	196	196	0	should be impacted by this patch but is not, possibly due to waiting for data from L3 cache
5	115	115	0	control, should not be impacted

It might be one of these cases where the performance difference is more impacted by random variations in instruction alignment, making it difficult to determine if an observed difference is actually related to the patch, or due to a random compiled binary layout.

To mitigate the impact of a specific compiler version, I'll try to collect the performance differences on a larger set of compilers and versions.

TocarIP · 2024-09-24T20:26:43Z

Interesting, where do I find silesia.tar to reproduce? I'm somewhat skeptical about instruction alignment here, since testing each location in isolation still causes (smaller) performance improvement but replacing cmov with branch causes performance regression, so probability of getting better alignment in all cases seems small. But since we are talking about predictability, it obviously depends a lot on CPU and dataset. As an aside I've found that zstd already does something similar (uncodniatil read instead of branch) in other cases: https://github.com/facebook/zstd/blob/dev/lib/compress/zstd_fast.c#L779

danlark1 · 2024-09-24T20:57:08Z

Also note gcc vs clang. At Google we exclusively use clang

TocarIP · 2024-09-24T21:00:44Z

Right, also it is a rolling release model, so doesn't directly corresponds to clang/llvm releases.

Cyan4973 · 2024-09-24T21:57:17Z

I have more results, using a larger set of compilers.
Focusing the benchmarking on level 1, which seems a good data point to capture the impact of this PR.

core i7-9700k (noturbo), Ubuntu 24.04, silesia.tar, level 1:

compiler	`dev`	this `PR`	difference
`gcc-9`	378	372	-6
`gcc-10`	381	370	-11
`gcc-11`	379	372	-7
`gcc-12`	385	368	-17
`gcc-13`	390	367	-23
`clang-14`	379	365	-14
`clang-15`	372	354	-18
`clang-16`	380	367	-13
`clang-17`	376	373	-3
`clang-18`	381	364	-17

Unfortunately, this looks like a pretty consistent loss, across all compilers, including clang.

Not sure what could explain such an observation difference,
I doubt the performance benefit depends on some novelty in llvm-trunk,
so this could be due to differences in compiler settings or hardware ?
The i7-9700k is a Skylake-era cpu, a well-trodden architecture with variants still deployed over a fairly large portion of active datacenters.

danlark1 · 2024-09-24T22:26:51Z

On clang-17 I have anectodal improvement after running some cli and perf stat commands. TL;DR some speed improvement and 1.5% fewer branch mispredictions

Overall I used ./zstd --zstd=wlog=15 -b1 ~/silesia/silesia.tar. If I don't specify the wlog, br misprediction rate is the same

 danilak @ danilak1 in ~/zstd/build/build_new (dev …1) 0 [23:20:02]
~ ../build_old/programs/zstd --zstd=wlog=15 -b1  ~/silesia/silesia.tar
 1#silesia.tar       : 211957760 ->  79871780 (x2.654),  355.0 MB/s, 1353.0 MB/s

& danilak @ danilak1 in ~/zstd/build/build_new (dev …1) 0 [23:20:56]
~ programs/zstd --zstd=wlog=15 -b1  ~/silesia/silesia.tar
 1#silesia.tar       : 211957760 ->  79871780 (x2.654),  398.9 MB/s, 1110.8 MB/s

& danilak @ danilak1 in ~/zstd/build/build_new (dev …1) 0 [23:21:12]
~ ../build_old/programs/zstd --zstd=wlog=15 -b1  ~/silesia/silesia.tar
 1#silesia.tar       : 211957760 ->  79871780 (x2.654),  349.0 MB/s, 1298.7 MB/s

& danilak @ danilak1 in ~/zstd/build/build_new (dev …1) 0 [23:21:23]
~ programs/zstd --zstd=wlog=15 -b1  ~/silesia/silesia.tar
 1#silesia.tar       : 211957760 ->  79871780 (x2.654),  393.6 MB/s, 1011.0 MB/s

& danilak @ danilak1 in ~/zstd/build/build_new (dev …1) 0 [23:21:32]
~ sudo perf stat programs/zstd --zstd=wlog=15 -b1  ~/silesia/silesia.tar
 1#silesia.tar       : 211957760 ->  79871780 (x2.654),  457.0 MB/s, 1357.5 MB/s

 Performance counter stats for 'programs/zstd --zstd=wlog=15 -b1 /home/danilak/silesia/silesia.tar':

          7,813.00 msec task-clock                       #    1.000 CPUs utilized             
               149      context-switches                 #   19.071 /sec                      
                 5      cpu-migrations                   #    0.640 /sec                      
               745      page-faults                      #   95.354 /sec                      
    28,338,622,044      cycles                           #    3.627 GHz                       
    78,055,510,248      instructions                     #    2.75  insn per cycle            
     7,438,789,171      branches                         #  952.104 M/sec                     
       220,283,420      branch-misses                    #    2.96% of all branches           
                        TopdownL1                 #     12.5 %  tma_backend_bound      
                                                  #     37.4 %  tma_bad_speculation    
                                                  #      4.5 %  tma_frontend_bound     
                                                  #     45.6 %  tma_retiring           

       7.816908293 seconds time elapsed

       7.725589000 seconds user
       0.087972000 seconds sys



& danilak @ danilak1 in ~/zstd/build/build_new (dev …1) 0 [23:21:43]
~ sudo perf stat ../build_old/programs/zstd --zstd=wlog=15 -b1  ~/silesia/silesia.tar
1#silesia.tar       : 211957760 ->  79871780 (x2.654),  354.0 MB/s, 1389.2 MB/s

 Performance counter stats for '../build_old/programs/zstd --zstd=wlog=15 -b1 /home/danilak/silesia/silesia.tar':

          6,959.95 msec task-clock                       #    1.000 CPUs utilized             
                83      context-switches                 #   11.925 /sec                      
                 4      cpu-migrations                   #    0.575 /sec                      
               741      page-faults                      #  106.466 /sec                      
    29,116,934,179      cycles                           #    4.183 GHz                       
    71,758,857,772      instructions                     #    2.46  insn per cycle            
     7,105,568,042      branches                         #    1.021 G/sec                     
       324,577,377      branch-misses                    #    4.57% of all branches           
                        TopdownL1                 #     13.2 %  tma_backend_bound      
                                                  #     42.5 %  tma_bad_speculation    
                                                  #      4.8 %  tma_frontend_bound     
                                                  #     39.4 %  tma_retiring           

       6.962525456 seconds time elapsed

       6.856733000 seconds user
       0.104011000 seconds sys

TocarIP · 2024-09-24T22:38:21Z

I'm getting similar results on a skylake (server/worksation variant) with clang 16

cat /proc/cpuinfo
...
model name	: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
...

With ./programs/zstd -b2 --zstd=hashLog=16,windowLog=16 -c -2 silesia.tar branch misprediction (as measured by perf stat -ddd, so combined for compress+decompress) goes from 4.4% mispredict to 2.9% and speed goes from 370MB/s to 420MB/s.
Without extra parameters speed is slightly worse 395 vs 400 MB/s. I'll take a look at level1 where the regression is worse

Cyan4973 · 2024-09-25T00:11:48Z

OK,
so the optimization is tied to branch misprediction rates,
and one way to influence the misprediction rates is to reduce the window size.

I presume that reducing window size makes the algorithm less likely to find matches within range, thus making the targeted branch less predictable during search.

Indeed, if I enforce a smaller window size, using --zstd=wlog=15, I can see this patch producing significant compression speed gains.

It adds a layer of complexity, as it seems this patch is beneficial for "some" scenarios (small window sizes, on x64 systems), but unfortunately detrimental for other ones (larger files, default level settings)

Cyan4973 · 2024-09-25T01:11:07Z

Note:
As an observation, I note that the performance of this patch is highly reliant on the assembly section.
Without it, the new code runs roughly at the same speed as the current dev one, despite the logic update.

It seems to indicate that the compiler is unable to generate a cmov from the portable C code.
Yet, when ZSTD_selectAddr() is compiled as an isolated function, using godbolt for example, all compilers have no problem generating a cmov. It's once inlined into the main function body that their choices are different.
Also, clang-14 specifically (and not later clang versions) is able to generate a compiled binary with performance similar to the hand-made assembly, underlining that the compilers could translate the portable scalar code the same way, but they just generally don't.

Cyan4973 · 2024-09-25T19:36:21Z

Small follow up:
since the compiler is able to generate a cmov instruction when the function ZSTD_selectAddr() is not inlined,
I tried an experiment where ZSTD_selectAddr() is forced to not be inlined.

And it worked: the non-inlined ZSTD_selectAddr() gets ~90% of the speed benefit of the hand crafted assembly version.

There is still a little cost due to jumping in and out of the function. So the hand-crafted assembly is still faster since it avoids that cost.

Note 2:
since the noinline trick makes it possible to implement the same optimization on non-x64 systems, I tested it on a M1 (aarch64) system,
and it indeed provided a boost to performance, similar to the one observed on x64.

TocarIP · 2024-09-25T20:21:29Z

Adding arm assembly shouldn't be too hard, if we are ok with having assembly

Cyan4973 · 2024-09-25T21:06:03Z

Adding arm assembly shouldn't be too hard, if we are ok with having assembly

In general, we try to avoid assembly, whenever it's not strictly necessary,
because it tends to make maintenance more difficult (including building on less common compilers, future evolutions, etc).

That being said, sometimes, it's indeed necessary.
So this should not be read as an exclusive rule, we have to exercise judgment.

Cyan4973 · 2024-10-03T20:08:43Z

After looking a bit more into it,
I could not get a consistent performance pattern using C only to ensure the compiler generates a cmov instead of an unpredictable branch. Some solutions work on some compilers, but not others. Or it may depend on the version, or the flags. or cmov is generated but badly inlined. etc.

So I guess we'll go the inline assembly route after all.

Cyan4973 · 2024-10-08T20:09:57Z

An interesting observation :

If I replace the test :

if (MEM_read32(ip0) == mval && idx >= prefixStartIndex) {

into:

if ((MEM_read32(ip0) == mval) & (idx >= prefixStartIndex)) {

(note how the && is converted into a & because both sides are 0/1),
then performance on clang tanks,
while it remains fine on gcc.

That in itself is not a big deal: we could keep &&, and satisfy both compilers.

But when I extract the test into its own function, returning a 0/1 boolean signal,
and now the branch is just :

if (ZSTD_matchFound(...)) {

then performance on clang tanks by the same amount,
while gcc still performs great.

So far, I've been unable to find out why.
It looks like a bug or limitation within clang optimizer.

I'll show in another PR why it's important and is currently a blocker.

Cyan4973 · 2024-10-10T20:58:07Z

Thanks @TocarIP ,
this work has been integrated into #4165 .

terrelln · 2024-10-16T21:34:36Z

@TocarIP and @danlark1, please let us know if PR #4165 is performing as well as this PR internally. We expect it to perform better, because of the improved speed on larger blocks. But if there are any performance regressions, we want to know so they can be fixed.

danlark1 · 2024-10-16T21:37:08Z

@TocarIP and @danlark1, please let us know if PR #4165 is performing as well as this PR internally. We expect it to perform better, because of the improved speed on larger blocks. But if there are any performance regressions, we want to know so they can be fixed.

We've tried internally and it works for us with the same speedup for smaller blocks. Thank you a lot for picking this up and making it work for all cases the right way!

terrelln · 2024-10-16T21:37:54Z

Thanks for the awesome patch!

facebook-github-bot added the CLA Signed label Sep 20, 2024

Cyan4973 self-assigned this Sep 24, 2024

Cyan4973 mentioned this pull request Oct 8, 2024

Improve compression speed on small blocks #4165

Merged

Cyan4973 closed this Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize compression by avoiding unpredictable branches #4144

Optimize compression by avoiding unpredictable branches #4144

TocarIP commented Sep 20, 2024

danlark1 commented Sep 21, 2024 •

edited

Loading

Cyan4973 commented Sep 24, 2024 •

edited

Loading

TocarIP commented Sep 24, 2024

danlark1 commented Sep 24, 2024

TocarIP commented Sep 24, 2024

Cyan4973 commented Sep 24, 2024 •

edited

Loading

danlark1 commented Sep 24, 2024 •

edited

Loading

TocarIP commented Sep 24, 2024

Cyan4973 commented Sep 25, 2024 •

edited

Loading

Cyan4973 commented Sep 25, 2024 •

edited

Loading

Cyan4973 commented Sep 25, 2024 •

edited

Loading

TocarIP commented Sep 25, 2024

Cyan4973 commented Sep 25, 2024

Cyan4973 commented Oct 3, 2024

Cyan4973 commented Oct 8, 2024

Cyan4973 commented Oct 10, 2024 •

edited

Loading

terrelln commented Oct 16, 2024

danlark1 commented Oct 16, 2024

terrelln commented Oct 16, 2024

Optimize compression by avoiding unpredictable branches #4144

Optimize compression by avoiding unpredictable branches #4144

Conversation

TocarIP commented Sep 20, 2024

danlark1 commented Sep 21, 2024 • edited Loading

Cyan4973 commented Sep 24, 2024 • edited Loading

TocarIP commented Sep 24, 2024

danlark1 commented Sep 24, 2024

TocarIP commented Sep 24, 2024

Cyan4973 commented Sep 24, 2024 • edited Loading

danlark1 commented Sep 24, 2024 • edited Loading

TocarIP commented Sep 24, 2024

Cyan4973 commented Sep 25, 2024 • edited Loading

Cyan4973 commented Sep 25, 2024 • edited Loading

Cyan4973 commented Sep 25, 2024 • edited Loading

TocarIP commented Sep 25, 2024

Cyan4973 commented Sep 25, 2024

Cyan4973 commented Oct 3, 2024

Cyan4973 commented Oct 8, 2024

Cyan4973 commented Oct 10, 2024 • edited Loading

terrelln commented Oct 16, 2024

danlark1 commented Oct 16, 2024

terrelln commented Oct 16, 2024

danlark1 commented Sep 21, 2024 •

edited

Loading

Cyan4973 commented Sep 24, 2024 •

edited

Loading

Cyan4973 commented Sep 24, 2024 •

edited

Loading

danlark1 commented Sep 24, 2024 •

edited

Loading

Cyan4973 commented Sep 25, 2024 •

edited

Loading

Cyan4973 commented Sep 25, 2024 •

edited

Loading

Cyan4973 commented Sep 25, 2024 •

edited

Loading

Cyan4973 commented Oct 10, 2024 •

edited

Loading