huff0: decompress directly into output #577

WojciechMula · 2022-05-05T12:30:27Z

Solves #576.
Benchmarks on Skylake and Ice Lake machines didn't show any boost. (On my very old laptop there's an improvement). It's reasonable, as perf report for tests shows that 95% of the time is spent in decompress routines. But benchmarks of zstd show some improvement.

I'm benchmarking on AWS machines and they are not that reliable for microbenchmarks. They can show a few per cent differences for two subsequent runs of the same code.

The Go implementation uses an auxiliary buffer to avoid extensive bound checks. In the asm implementation we know in advance the output buffer is big enough, so it's possible to write directly to that buffer.

klauspost · 2022-05-06T12:51:18Z

@WojciechMula We can just abandon #743

I fixed out-of-bounds writes, since we didn't check if we exceeded the write buffer.
I re-added the separate fill routine from the other PR.
bufferOrigin can be a register in decompress4x_main_loop_amd64.
exhausted doesn't use high/low bytes anymore. This can be slow on older CPUs.
Fixed build tags

Please check out/upgrade to: https://gist.github.com/klauspost/306ed219fb5e275d7c9abb87c9a7b142

WojciechMula · 2022-05-06T13:03:47Z

@klauspost I agree, this change is quite small, there's no point in having to PRs.

Thanks for the fixes, will incorporate them.

(Added you as a collabolator to my fork, so you don't have to play with gists. Sorry, I should have done it earlier.)

By Klaus Post (https://gist.github.com/klauspost/306ed219fb5e275d7c9abb87c9a7b142)

WojciechMula · 2022-05-06T13:19:09Z

Regarding the build tags: seems that avo invoked for Go 1.18 does not include the old build tags. Yesterday I tried to figure it out in the avo sources but didn't find where it's checked. I'll recheck that in the evening.

huff0/decompress_amd64.go

klauspost · 2022-05-06T14:13:39Z

Aside from the crash from above fuzz tests are running fine 👍🏼

klauspost · 2022-05-06T16:13:45Z

FWIW I tried reading 56 bits and doing decode 4 and 6 values for decompress4x_main_loop_amd64 and decompress4x_8b_main_loop_amd64 respectively.

Reading 6 instead of 4 values is significantly slower (~20% deg).

The only gain is BenchmarkDecompress4XNoTable/twain/262143-32 699.10 814.60 1.17x and BenchmarkDecompress4XNoTable/gettysburg/262143-32 799.28 822.87 1.03x reading 4 values instead of 2 per loop.

klauspost

benchmark                                                   old ns/op     new ns/op     delta
BenchmarkDecompress4XNoTable/digits/100-32                  334           336           +0.66%
BenchmarkDecompress4XNoTable/digits/10000-32                10835         9562          -11.75%
BenchmarkDecompress4XNoTable/digits/262143-32               303585        270811        -10.80%
BenchmarkDecompress4XNoTable/gettysburg/100-32              285           286           +0.56%
BenchmarkDecompress4XNoTable/gettysburg/10000-32            11393         10268         -9.87%
BenchmarkDecompress4XNoTable/gettysburg/262143-32           327973        289561        -11.71%
BenchmarkDecompress4XNoTable/twain/100-32                   331           330           -0.27%
BenchmarkDecompress4XNoTable/twain/10000-32                 11458         10235         -10.67%
BenchmarkDecompress4XNoTable/twain/262143-32                374970        345400        -7.89%
BenchmarkDecompress4XNoTable/low-ent.10k/100-32             367           371           +1.01%
BenchmarkDecompress4XNoTable/low-ent.10k/10000-32           10812         9398          -13.08%
BenchmarkDecompress4XNoTable/low-ent.10k/262143-32          256684        221666        -13.64%
BenchmarkDecompress4XNoTable/superlow-ent-10k/262143-32     256839        221322        -13.83%
BenchmarkDecompress4XNoTable/case1/100-32                   318           322           +1.23%
BenchmarkDecompress4XNoTable/case1/10000-32                 10803         9562          -11.49%
BenchmarkDecompress4XNoTable/case1/262143-32                277377        242147        -12.70%
BenchmarkDecompress4XNoTable/case2/100-32                   345           340           -1.62%
BenchmarkDecompress4XNoTable/case2/10000-32                 10659         9473          -11.13%
BenchmarkDecompress4XNoTable/case2/262143-32                268723        236376        -12.04%
BenchmarkDecompress4XNoTable/case3/100-32                   333           336           +0.99%
BenchmarkDecompress4XNoTable/case3/10000-32                 10737         9357          -12.85%
BenchmarkDecompress4XNoTable/case3/262143-32                272268        239011        -12.21%
BenchmarkDecompress4XNoTable/pngdata.001/100-32             361           350           -2.99%
BenchmarkDecompress4XNoTable/pngdata.001/10000-32           11583         10740         -7.28%
BenchmarkDecompress4XNoTable/pngdata.001/262143-32          306257        279850        -8.62%
BenchmarkDecompress4XNoTable/normcount2/100-32              287           289           +0.80%
BenchmarkDecompress4XNoTable/normcount2/10000-32            10832         9696          -10.49%
BenchmarkDecompress4XNoTable/normcount2/262143-32           279908        247688        -11.51%
BenchmarkDecompress4XNoTableTableLog8/digits-32             107990        96969         -10.21%

huff0/_generate/gen.go

Co-authored-by: Klaus Post <klauspost@gmail.com>

WojciechMula · 2022-05-06T17:47:02Z

Your benchmarks look really good! And match expectations... I'm still puzzled by the regressions observed on my Ice Lake machine.

Co-authored-by: Klaus Post <klauspost@gmail.com>

Presence of this tag depends on the Go version on which `go generate` was run, see mmcloughlin/avo#183.

WojciechMula · 2022-05-06T18:26:31Z

Regarding the build tags: seems that avo invoked for Go 1.18 does not include the old build tags. Yesterday I tried to figure it out in the avo sources but didn't find where it's checked. I'll recheck that in the evening.

@klauspost mmcloughlin/avo#183. It would be great if these settings may be overridden in runtime.

mmcloughlin · 2022-05-08T22:24:37Z

Can you add this to the generate workflow in Github actions?

klauspost · 2022-05-09T07:29:34Z

Can you add this to the generate workflow in Github actions?

Adding in #582

WojciechMula added 2 commits May 5, 2022 11:20

huff0: translate asm implementation into avo program

613a0f3

huff0: decompress directly into output buffer

d5cf454

The Go implementation uses an auxiliary buffer to avoid extensive bound checks. In the asm implementation we know in advance the output buffer is big enough, so it's possible to write directly to that buffer.

WojciechMula mentioned this pull request May 6, 2022

huff0: translate asm implementation into avo program #543

Closed

Fixes out-of-bound error and simplified code

ad057f6

By Klaus Post (https://gist.github.com/klauspost/306ed219fb5e275d7c9abb87c9a7b142)

klauspost reviewed May 6, 2022

View reviewed changes

huff0/decompress_amd64.go Outdated Show resolved Hide resolved

klauspost reviewed May 6, 2022

View reviewed changes

huff0/_generate/gen.go Outdated Show resolved Hide resolved

huff0/_generate/gen.go Outdated Show resolved Hide resolved

huff0/_generate/gen.go Outdated Show resolved Hide resolved

Update huff0/decompress_amd64.go

f80d285

Co-authored-by: Klaus Post <klauspost@gmail.com>

WojciechMula and others added 8 commits May 6, 2022 19:47

Update huff0/_generate/gen.go

960fd97

Co-authored-by: Klaus Post <klauspost@gmail.com>

Update huff0/_generate/gen.go

1eb237b

Co-authored-by: Klaus Post <klauspost@gmail.com>

Update huff0/_generate/gen.go

4023afc

Co-authored-by: Klaus Post <klauspost@gmail.com>

Regenerated asm sources

170b86e

Drop BMI2 support

58c3620

Update comments

ad9726f

[skip ci] Move ctx into if statement

e543746

Manually add +build tag

a070d8e

Presence of this tag depends on the Go version on which `go generate` was run, see mmcloughlin/avo#183.

klauspost merged commit e7c028f into klauspost:master May 7, 2022

WojciechMula deleted the huff0-decompress-directly-into-output branch May 9, 2022 12:33

WojciechMula mentioned this pull request May 13, 2022

huff0: provide asm implementation of Decompress1X #595

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

huff0: decompress directly into output #577

huff0: decompress directly into output #577

WojciechMula commented May 5, 2022 •

edited

Loading

klauspost commented May 6, 2022 •

edited

Loading

WojciechMula commented May 6, 2022 •

edited

Loading

WojciechMula commented May 6, 2022

klauspost commented May 6, 2022

klauspost commented May 6, 2022 •

edited

Loading

klauspost left a comment •

edited

Loading

WojciechMula commented May 6, 2022

WojciechMula commented May 6, 2022

mmcloughlin commented May 8, 2022

klauspost commented May 9, 2022

huff0: decompress directly into output #577

huff0: decompress directly into output #577

Conversation

WojciechMula commented May 5, 2022 • edited Loading

klauspost commented May 6, 2022 • edited Loading

WojciechMula commented May 6, 2022 • edited Loading

WojciechMula commented May 6, 2022

klauspost commented May 6, 2022

klauspost commented May 6, 2022 • edited Loading

klauspost left a comment • edited Loading

Choose a reason for hiding this comment

WojciechMula commented May 6, 2022

WojciechMula commented May 6, 2022

mmcloughlin commented May 8, 2022

klauspost commented May 9, 2022

WojciechMula commented May 5, 2022 •

edited

Loading

klauspost commented May 6, 2022 •

edited

Loading

WojciechMula commented May 6, 2022 •

edited

Loading

klauspost commented May 6, 2022 •

edited

Loading

klauspost left a comment •

edited

Loading