Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize compression by avoiding unpredictable branches #4144

Closed
wants to merge 1 commit into from

Conversation

TocarIP
Copy link
Contributor

@TocarIP TocarIP commented Sep 20, 2024

Avoid unpredictable branch. Use conditional move to generate the address that is guaranteed to be safe and compare unconditionally. Instead of

if (idx < limit && x[idx] == val ) // mispredicted idx < limit branch

Do

addr = cmov(safe,x+idx)
if (*addr == val && idx < limit) // almost always false so well predicted

Using microbenchmarks from https://github.com/google/fleetbench, I get about ~10% speed-up:

name old cpu/op new cpu/op delta
BM_ZSTD_COMPRESS_Fleet/compression_level:-7/window_log:15 1.46ns ± 3% 1.31ns ± 7% -9.88% (p=0.000 n=35+38)
BM_ZSTD_COMPRESS_Fleet/compression_level:-7/window_log:16 1.41ns ± 3% 1.28ns ± 3% -9.56% (p=0.000 n=36+39)
BM_ZSTD_COMPRESS_Fleet/compression_level:-5/window_log:15 1.61ns ± 1% 1.43ns ± 3% -10.70% (p=0.000 n=30+39)
BM_ZSTD_COMPRESS_Fleet/compression_level:-5/window_log:16 1.54ns ± 2% 1.39ns ± 3% -9.21% (p=0.000 n=37+39)
BM_ZSTD_COMPRESS_Fleet/compression_level:-3/window_log:15 1.82ns ± 2% 1.61ns ± 3% -11.31% (p=0.000 n=37+40)
BM_ZSTD_COMPRESS_Fleet/compression_level:-3/window_log:16 1.73ns ± 3% 1.56ns ± 3% -9.50% (p=0.000 n=38+39)
BM_ZSTD_COMPRESS_Fleet/compression_level:-1/window_log:15 2.12ns ± 2% 1.79ns ± 3% -15.55% (p=0.000 n=34+39)
BM_ZSTD_COMPRESS_Fleet/compression_level:-1/window_log:16 1.99ns ± 3% 1.72ns ± 3% -13.70% (p=0.000 n=38+38)
BM_ZSTD_COMPRESS_Fleet/compression_level:0/window_log:15 3.22ns ± 3% 2.94ns ± 3% -8.67% (p=0.000 n=38+40)
BM_ZSTD_COMPRESS_Fleet/compression_level:0/window_log:16 3.19ns ± 4% 2.86ns ± 4% -10.55% (p=0.000 n=40+38)
BM_ZSTD_COMPRESS_Fleet/compression_level:1/window_log:15 2.60ns ± 3% 2.22ns ± 3% -14.53% (p=0.000 n=40+39)
BM_ZSTD_COMPRESS_Fleet/compression_level:1/window_log:16 2.46ns ± 3% 2.13ns ± 2% -13.67% (p=0.000 n=39+36)
BM_ZSTD_COMPRESS_Fleet/compression_level:2/window_log:15 2.69ns ± 3% 2.46ns ± 3% -8.63% (p=0.000 n=37+39)
BM_ZSTD_COMPRESS_Fleet/compression_level:2/window_log:16 2.63ns ± 3% 2.36ns ± 3% -10.47% (p=0.000 n=40+40)
BM_ZSTD_COMPRESS_Fleet/compression_level:3/window_log:15 3.20ns ± 2% 2.95ns ± 3% -7.94% (p=0.000 n=35+40)
BM_ZSTD_COMPRESS_Fleet/compression_level:3/window_log:16 3.20ns ± 4% 2.87ns ± 4% -10.33% (p=0.000 n=40+40)

I've also measured the impact on internal workloads and saw similar ~10% improvement in performance, measured by cpu usage/byte of data.

Avoid unpredictable branch. Use conditional move to generate the address
that is guaranteed to be safe and compare unconditionally.
Instead of

if (idx < limit && x[idx] == val ) // mispredicted idx < limit branch

Do

addr = cmov(safe,x+idx)
if (*addr == val && idx < limit) // almost always false so well predicted

Using microbenchmarks from https://github.com/google/fleetbench,
I get about ~10% speed-up:

name                                                                                          old cpu/op   new cpu/op    delta
BM_ZSTD_COMPRESS_Fleet/compression_level:-7/window_log:15                                     1.46ns ± 3%   1.31ns ± 7%   -9.88%  (p=0.000 n=35+38)
BM_ZSTD_COMPRESS_Fleet/compression_level:-7/window_log:16                                     1.41ns ± 3%   1.28ns ± 3%   -9.56%  (p=0.000 n=36+39)
BM_ZSTD_COMPRESS_Fleet/compression_level:-5/window_log:15                                     1.61ns ± 1%   1.43ns ± 3%  -10.70%  (p=0.000 n=30+39)
BM_ZSTD_COMPRESS_Fleet/compression_level:-5/window_log:16                                     1.54ns ± 2%   1.39ns ± 3%   -9.21%  (p=0.000 n=37+39)
BM_ZSTD_COMPRESS_Fleet/compression_level:-3/window_log:15                                     1.82ns ± 2%   1.61ns ± 3%  -11.31%  (p=0.000 n=37+40)
BM_ZSTD_COMPRESS_Fleet/compression_level:-3/window_log:16                                     1.73ns ± 3%   1.56ns ± 3%   -9.50%  (p=0.000 n=38+39)
BM_ZSTD_COMPRESS_Fleet/compression_level:-1/window_log:15                                     2.12ns ± 2%   1.79ns ± 3%  -15.55%  (p=0.000 n=34+39)
BM_ZSTD_COMPRESS_Fleet/compression_level:-1/window_log:16                                     1.99ns ± 3%   1.72ns ± 3%  -13.70%  (p=0.000 n=38+38)
BM_ZSTD_COMPRESS_Fleet/compression_level:0/window_log:15                                      3.22ns ± 3%   2.94ns ± 3%   -8.67%  (p=0.000 n=38+40)
BM_ZSTD_COMPRESS_Fleet/compression_level:0/window_log:16                                      3.19ns ± 4%   2.86ns ± 4%  -10.55%  (p=0.000 n=40+38)
BM_ZSTD_COMPRESS_Fleet/compression_level:1/window_log:15                                      2.60ns ± 3%   2.22ns ± 3%  -14.53%  (p=0.000 n=40+39)
BM_ZSTD_COMPRESS_Fleet/compression_level:1/window_log:16                                      2.46ns ± 3%   2.13ns ± 2%  -13.67%  (p=0.000 n=39+36)
BM_ZSTD_COMPRESS_Fleet/compression_level:2/window_log:15                                      2.69ns ± 3%   2.46ns ± 3%   -8.63%  (p=0.000 n=37+39)
BM_ZSTD_COMPRESS_Fleet/compression_level:2/window_log:16                                      2.63ns ± 3%   2.36ns ± 3%  -10.47%  (p=0.000 n=40+40)
BM_ZSTD_COMPRESS_Fleet/compression_level:3/window_log:15                                      3.20ns ± 2%   2.95ns ± 3%   -7.94%  (p=0.000 n=35+40)
BM_ZSTD_COMPRESS_Fleet/compression_level:3/window_log:16                                      3.20ns ± 4%   2.87ns ± 4%  -10.33%  (p=0.000 n=40+40)

I've also measured the impact on internal workloads and saw similar
~10% improvement in performance, measured by cpu usage/byte of data.
@danlark1
Copy link
Contributor

danlark1 commented Sep 21, 2024

Note that we observed the most improvements on window log 15, 16. Window logs from 18 see little to no improvement which should be reasonable given the hashtable sizes

And 10% improvement comes from running in Google production for several months

@Cyan4973 Cyan4973 self-assigned this Sep 24, 2024
@Cyan4973
Copy link
Contributor

Cyan4973 commented Sep 24, 2024

I had a look into this, and unfortunately observed a performance regression with this patch on the test system:

core i7-9700k (noturbo), Ubuntu 24.04, gcc 13.2, silesia.tar:

level dev this PR difference comment
-3 493 484 -9
-2 466 456 -10
-1 421 403 -18
1 386 371 -15
2 290 284 -6 difference is lowered when data access pattern reaches L3 cache
3 208 202 -6
4 196 196 0 should be impacted by this patch but is not, possibly due to waiting for data from L3 cache
5 115 115 0 control, should not be impacted

It might be one of these cases where the performance difference is more impacted by random variations in instruction alignment, making it difficult to determine if an observed difference is actually related to the patch, or due to a random compiled binary layout.

To mitigate the impact of a specific compiler version, I'll try to collect the performance differences on a larger set of compilers and versions.

@TocarIP
Copy link
Contributor Author

TocarIP commented Sep 24, 2024

Interesting, where do I find silesia.tar to reproduce? I'm somewhat skeptical about instruction alignment here, since testing each location in isolation still causes (smaller) performance improvement but replacing cmov with branch causes performance regression, so probability of getting better alignment in all cases seems small. But since we are talking about predictability, it obviously depends a lot on CPU and dataset. As an aside I've found that zstd already does something similar (uncodniatil read instead of branch) in other cases: https://github.com/facebook/zstd/blob/dev/lib/compress/zstd_fast.c#L779

@danlark1
Copy link
Contributor

Also note gcc vs clang. At Google we exclusively use clang

@TocarIP
Copy link
Contributor Author

TocarIP commented Sep 24, 2024

Right, also it is a rolling release model, so doesn't directly corresponds to clang/llvm releases.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Sep 24, 2024

I have more results, using a larger set of compilers.
Focusing the benchmarking on level 1, which seems a good data point to capture the impact of this PR.

core i7-9700k (noturbo), Ubuntu 24.04, silesia.tar, level 1:

compiler dev this PR difference
gcc-9 378 372 -6
gcc-10 381 370 -11
gcc-11 379 372 -7
gcc-12 385 368 -17
gcc-13 390 367 -23
clang-14 379 365 -14
clang-15 372 354 -18
clang-16 380 367 -13
clang-17 376 373 -3
clang-18 381 364 -17

Unfortunately, this looks like a pretty consistent loss, across all compilers, including clang.

Not sure what could explain such an observation difference,
I doubt the performance benefit depends on some novelty in llvm-trunk,
so this could be due to differences in compiler settings or hardware ?
The i7-9700k is a Skylake-era cpu, a well-trodden architecture with variants still deployed over a fairly large portion of active datacenters.

@danlark1
Copy link
Contributor

danlark1 commented Sep 24, 2024

On clang-17 I have anectodal improvement after running some cli and perf stat commands. TL;DR some speed improvement and 1.5% fewer branch mispredictions

Overall I used ./zstd --zstd=wlog=15 -b1 ~/silesia/silesia.tar. If I don't specify the wlog, br misprediction rate is the same

 danilak @ danilak1 in ~/zstd/build/build_new (dev …1) 0 [23:20:02]
~ ../build_old/programs/zstd --zstd=wlog=15 -b1  ~/silesia/silesia.tar
 1#silesia.tar       : 211957760 ->  79871780 (x2.654),  355.0 MB/s, 1353.0 MB/s

& danilak @ danilak1 in ~/zstd/build/build_new (dev …1) 0 [23:20:56]
~ programs/zstd --zstd=wlog=15 -b1  ~/silesia/silesia.tar
 1#silesia.tar       : 211957760 ->  79871780 (x2.654),  398.9 MB/s, 1110.8 MB/s

& danilak @ danilak1 in ~/zstd/build/build_new (dev …1) 0 [23:21:12]
~ ../build_old/programs/zstd --zstd=wlog=15 -b1  ~/silesia/silesia.tar
 1#silesia.tar       : 211957760 ->  79871780 (x2.654),  349.0 MB/s, 1298.7 MB/s

& danilak @ danilak1 in ~/zstd/build/build_new (dev …1) 0 [23:21:23]
~ programs/zstd --zstd=wlog=15 -b1  ~/silesia/silesia.tar
 1#silesia.tar       : 211957760 ->  79871780 (x2.654),  393.6 MB/s, 1011.0 MB/s

& danilak @ danilak1 in ~/zstd/build/build_new (dev …1) 0 [23:21:32]
~ sudo perf stat programs/zstd --zstd=wlog=15 -b1  ~/silesia/silesia.tar
 1#silesia.tar       : 211957760 ->  79871780 (x2.654),  457.0 MB/s, 1357.5 MB/s

 Performance counter stats for 'programs/zstd --zstd=wlog=15 -b1 /home/danilak/silesia/silesia.tar':

          7,813.00 msec task-clock                       #    1.000 CPUs utilized             
               149      context-switches                 #   19.071 /sec                      
                 5      cpu-migrations                   #    0.640 /sec                      
               745      page-faults                      #   95.354 /sec                      
    28,338,622,044      cycles                           #    3.627 GHz                       
    78,055,510,248      instructions                     #    2.75  insn per cycle            
     7,438,789,171      branches                         #  952.104 M/sec                     
       220,283,420      branch-misses                    #    2.96% of all branches           
                        TopdownL1                 #     12.5 %  tma_backend_bound      
                                                  #     37.4 %  tma_bad_speculation    
                                                  #      4.5 %  tma_frontend_bound     
                                                  #     45.6 %  tma_retiring           

       7.816908293 seconds time elapsed

       7.725589000 seconds user
       0.087972000 seconds sys



& danilak @ danilak1 in ~/zstd/build/build_new (dev …1) 0 [23:21:43]
~ sudo perf stat ../build_old/programs/zstd --zstd=wlog=15 -b1  ~/silesia/silesia.tar
1#silesia.tar       : 211957760 ->  79871780 (x2.654),  354.0 MB/s, 1389.2 MB/s

 Performance counter stats for '../build_old/programs/zstd --zstd=wlog=15 -b1 /home/danilak/silesia/silesia.tar':

          6,959.95 msec task-clock                       #    1.000 CPUs utilized             
                83      context-switches                 #   11.925 /sec                      
                 4      cpu-migrations                   #    0.575 /sec                      
               741      page-faults                      #  106.466 /sec                      
    29,116,934,179      cycles                           #    4.183 GHz                       
    71,758,857,772      instructions                     #    2.46  insn per cycle            
     7,105,568,042      branches                         #    1.021 G/sec                     
       324,577,377      branch-misses                    #    4.57% of all branches           
                        TopdownL1                 #     13.2 %  tma_backend_bound      
                                                  #     42.5 %  tma_bad_speculation    
                                                  #      4.8 %  tma_frontend_bound     
                                                  #     39.4 %  tma_retiring           

       6.962525456 seconds time elapsed

       6.856733000 seconds user
       0.104011000 seconds sys

@TocarIP
Copy link
Contributor Author

TocarIP commented Sep 24, 2024

I'm getting similar results on a skylake (server/worksation variant) with clang 16

cat /proc/cpuinfo
...
model name	: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
...

With ./programs/zstd -b2 --zstd=hashLog=16,windowLog=16 -c -2 silesia.tar branch misprediction (as measured by perf stat -ddd, so combined for compress+decompress) goes from 4.4% mispredict to 2.9% and speed goes from 370MB/s to 420MB/s.
Without extra parameters speed is slightly worse 395 vs 400 MB/s. I'll take a look at level1 where the regression is worse

@Cyan4973
Copy link
Contributor

Cyan4973 commented Sep 25, 2024

OK,
so the optimization is tied to branch misprediction rates,
and one way to influence the misprediction rates is to reduce the window size.

I presume that reducing window size makes the algorithm less likely to find matches within range, thus making the targeted branch less predictable during search.

Indeed, if I enforce a smaller window size, using --zstd=wlog=15, I can see this patch producing significant compression speed gains.

It adds a layer of complexity, as it seems this patch is beneficial for "some" scenarios (small window sizes, on x64 systems), but unfortunately detrimental for other ones (larger files, default level settings)

@Cyan4973
Copy link
Contributor

Cyan4973 commented Sep 25, 2024

Note:
As an observation, I note that the performance of this patch is highly reliant on the assembly section.
Without it, the new code runs roughly at the same speed as the current dev one, despite the logic update.

It seems to indicate that the compiler is unable to generate a cmov from the portable C code.
Yet, when ZSTD_selectAddr() is compiled as an isolated function, using godbolt for example, all compilers have no problem generating a cmov. It's once inlined into the main function body that their choices are different.
Also, clang-14 specifically (and not later clang versions) is able to generate a compiled binary with performance similar to the hand-made assembly, underlining that the compilers could translate the portable scalar code the same way, but they just generally don't.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Sep 25, 2024

Small follow up:
since the compiler is able to generate a cmov instruction when the function ZSTD_selectAddr() is not inlined,
I tried an experiment where ZSTD_selectAddr() is forced to not be inlined.

And it worked: the non-inlined ZSTD_selectAddr() gets ~90% of the speed benefit of the hand crafted assembly version.

There is still a little cost due to jumping in and out of the function. So the hand-crafted assembly is still faster since it avoids that cost.

Note 2:
since the noinline trick makes it possible to implement the same optimization on non-x64 systems, I tested it on a M1 (aarch64) system,
and it indeed provided a boost to performance, similar to the one observed on x64.

@TocarIP
Copy link
Contributor Author

TocarIP commented Sep 25, 2024

Adding arm assembly shouldn't be too hard, if we are ok with having assembly

@Cyan4973
Copy link
Contributor

Adding arm assembly shouldn't be too hard, if we are ok with having assembly

In general, we try to avoid assembly, whenever it's not strictly necessary,
because it tends to make maintenance more difficult (including building on less common compilers, future evolutions, etc).

That being said, sometimes, it's indeed necessary.
So this should not be read as an exclusive rule, we have to exercise judgment.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Oct 3, 2024

After looking a bit more into it,
I could not get a consistent performance pattern using C only to ensure the compiler generates a cmov instead of an unpredictable branch. Some solutions work on some compilers, but not others. Or it may depend on the version, or the flags. or cmov is generated but badly inlined. etc.

So I guess we'll go the inline assembly route after all.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Oct 8, 2024

An interesting observation :

If I replace the test :

if (MEM_read32(ip0) == mval && idx >= prefixStartIndex) {

into:

if ((MEM_read32(ip0) == mval) & (idx >= prefixStartIndex)) {

(note how the && is converted into a & because both sides are 0/1),
then performance on clang tanks,
while it remains fine on gcc.

That in itself is not a big deal: we could keep &&, and satisfy both compilers.

But when I extract the test into its own function, returning a 0/1 boolean signal,
and now the branch is just :

if (ZSTD_matchFound(...)) {

then performance on clang tanks by the same amount,
while gcc still performs great.

So far, I've been unable to find out why.
It looks like a bug or limitation within clang optimizer.

I'll show in another PR why it's important and is currently a blocker.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Oct 10, 2024

Thanks @TocarIP ,
this work has been integrated into #4165 .

@Cyan4973 Cyan4973 closed this Oct 10, 2024
@terrelln
Copy link
Contributor

@TocarIP and @danlark1, please let us know if PR #4165 is performing as well as this PR internally. We expect it to perform better, because of the improved speed on larger blocks. But if there are any performance regressions, we want to know so they can be fixed.

@danlark1
Copy link
Contributor

@TocarIP and @danlark1, please let us know if PR #4165 is performing as well as this PR internally. We expect it to perform better, because of the improved speed on larger blocks. But if there are any performance regressions, we want to know so they can be fixed.

We've tried internally and it works for us with the same speedup for smaller blocks. Thank you a lot for picking this up and making it work for all cases the right way!

@terrelln
Copy link
Contributor

Thanks for the awesome patch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants