Replace ASM with CryptOpt generated #1329

dderjoel · 2023-05-25T05:02:29Z

In this PR the auto detection of x86_64 is removed and the configure file will default to the C implementation.
Further, if asm has explicitly been requested with --with-asm=x86_64,
the selfcheck method check by calling the cpuid instruction, if the flags BMI2 (bit 8) and ADX (bit 19) are set, and will exit early, if not.

src/third_party/field_5x52_asm_impl_cryptopt.h

dderjoel · 2023-05-25T08:53:12Z

Can someone help with those two errors?
I can see that those runs exit with segmentation faults, but I cannot reproduce them on my machine.
How do you typically debug those (I have no experience with Cirrus.CI). Is there a Docker container that I can clone, or the binary I can run in a debugger locally?

real-or-random · 2023-05-25T09:46:09Z

Is there a Docker container that I can clone,

You can build the Docker image locally using docker build . -f ci/linux-debian.Dockerfile (add --no-cache --pull to force rebuilds).

real-or-random · 2023-05-25T09:49:39Z

But what's special about these builds is that they have WIDEMUL:int128.., Try ./configure --with-test-override-wide-multiply=int128.

src/third_party/field_5x52_asm_impl_cryptopt.h

dderjoel · 2023-05-29T02:29:01Z

But what's special about these builds is that they have WIDEMUL:int128.., Try ./configure --with-test-override-wide-multiply=int128.

I actually did that locally, too but could not reproduce the issue. What worked was building the container and run the cirrus script in there with the given environment, as per cirrus logs.

I could then reproduce and fix the issue, which I believe depends a bit on the compiler version.
Essentially, the problem was that the assembly code relies on being a leaf function, i.e. assumes rsp pointing to the end of the stack, can use the red zone, will spill callee-save registers when needed and unspill them when done. As the code is potentially inlined (as in those tests), those assumptions are incorrect.

I've changed the code by hand to not spill into rsp-based memory whatsoever, and instead use three tmp variables like the old code. Additionally, I used the +D and +d modifiers for the arguments.

dderjoel · 2023-05-29T04:15:47Z

The currently failing tests don't seem to be related to the field arithmetic. Instead, they seem to fail at the self test when calling ./ctime.
As far as I can see, it fails gracefully after the secp256k1_selftest_passes returns 0, because the required CPU flags are not available. This however, does not seem to be caused by missing cpuflags, but rather because of running CPUID in valgrind.

https://cirrus-ci.com/task/4534411765481472?logs=test#L237

==3831== Process terminating with default action of signal 6 (SIGABRT): dumping core
==3831== at 0x49B2CE1: raise (raise.c:51)
==3831== by 0x499C536: abort (abort.c:79)
==3831== by 0x484B2E3: secp256k1_default_error_callback_fn (util.h:72)
==3831== by 0x4852C95: secp256k1_callback_call (util.h:60)
==3831== by 0x4852C95: secp256k1_selftest (secp256k1.c:87)
==3831== by 0x4852C95: secp256k1_selftest (secp256k1.c:85)
==3831== by 0x4852D9F: secp256k1_context_preallocated_create (secp256k1.c:121)
==3831== by 0x4852E37: secp256k1_context_create (secp256k1.c:143)
==3831== by 0x1091F2: main (ctime_tests.c:45)

dderjoel · 2023-05-29T05:30:31Z

Nice, so I've changed the selftest such that it will not check the CPU flags, if it is running in valgrind. I've used SECP256K1_CHECKMEM_RUNNING for this, which I believe is also used in ./ctime_tests.c to early abort if it is not running in valgrind.

dderjoel · 2023-05-29T08:04:38Z

Looks good to me now actually, but there are a couple points that I should point out:

CPUID in the self test is not checked when using valgrind.
The spilling / unspilling of callee-saved registers is removed from the assembly implementations and with
the spilling of three intermediate values is now handled by CC (using %q0..%q2). This, means that the code as it stands there, is not formally verified, because of the two changed just give. (It was not 100% before either, because we converted the verified Intel assembly to att syntax (by assembling though nasm to objdumping it again as att)).
Using --with-asm=auto or not specifying that at all will now default to --with-asm=no. If --with-asm=x86-64 is explicitly set, the CryptOpt generated version is used. The self check checks for presence of BMI2 and ADX flags and errors gracefully if not.

src/third_party/field_5x52_asm_impl_cryptopt.h

dderjoel · 2023-06-07T22:38:13Z

Is there anything I can help with for this PR? What would the next steps be?

dderjoel · 2023-06-17T15:32:09Z

Hey, we’d love to mention this at an upcoming conference (acknowledging your support; top event in heuristic optimization), but for that it would have to be merged by 28 June. Of course, we understand that you need to be convinced of the usefulness of the verified and optimized... and of course we understand that you have your own workflows and deadlines, too. Still, if there is anything we can do to push this further, please do let us know.

jonasnick · 2023-06-20T14:51:47Z

@dderjoel A common strategy to find reviewers is to convince them that the benefits of the PR are worth the review time. The PR should be as simple as possible to review. Squash the commits into a single or more logical commits. Provide some context and guidance on the best way to review the changes. It seems to me that reviewing this is difficult.

However, keep in mind that, as per the plan discussed here, integrating the fiat-crypto C output code is currently a higher priority.

real-or-random

Indentation should be 4 spaces in C files. (Sometimes only 2 are used here.)

I agree with @jonasnick, it will help to have clean and well separated commits.

And yes, at least according to the plan we made, this is blocked on fiat-crypto C code. No matter if it's blocked or not, it's not realistic to get this merged by 28th. But I feel it's certainly fair to say that the code is under review and planned to be integrated (if this helps).

src/selftest.h

real-or-random · 2023-06-20T22:04:23Z

src/third_party/field_5x52_asm_impl_cryptopt.h

+secp256k1_fe_mul_inner(uint64_t *r, const uint64_t *a,
+                       const uint64_t *SECP256K1_RESTRICT b) {


This can go on one line. (We don't have a fixed value for max length of lines, but they tend to be longer than shorter.)

I know you took the SECP256K1_RESTRICT from the old code, but I think it can be dropped. It's rather meaningless here. (What is the compiler supposed to optimize here?! It's just asm.) Or am I overlooking something?

(What is the compiler supposed to optimize here?! It's just asm.)

Now that I think about it again, I suggest leaving it in. It's there now, and whether to remove it is probably a separate discussion.

But in general, can you confirm that the code here is intended to stick to the rules laid out in field.h?

/* (...) * r and a may point to the same object, but neither can be equal to b. (...) */ static void secp256k1_fe_mul(secp256k1_fe *r, const secp256k1_fe *a, const secp256k1_fe *b);

Neither the Optimizer, nor the checker enforces that property. But manually checking it at the moment shows that it is the case. Maybe it's possible to check that on the fiat side. I'm tracking that in issue 0xADE1A1DE/CryptOpt#167 for the cryptopt side.

src/third_party/field_5x52_asm_impl_cryptopt.h

dderjoel · 2023-06-21T14:54:51Z

Thanks for the feedback. I kept the commits in there to have the history and thought we may squash them on merge.
I've rebased everything now from the current master, and renamed the commits, while incorporating the ideas from you all.

dderjoel · 2023-06-27T22:27:55Z

Do not merge until tested with gcc -O3 -march=skylake and clang -march=skylake
See mit-plv/fiat-crypto#1560 (comment)

dderjoel · 2023-06-28T08:46:30Z

Problem seems indeed to be rbp. See this comment. I'm working on generating code without using rbp, tracked by this issue. I'll try to get this sorted by next week. Then get some code and update the PR.

dderjoel · 2023-07-07T00:45:19Z

Hi,
CryptOpt has now two new features, one is to not use rbp and the other one is to value the potential memory overlap of the first two parameters, as @real-or-random said earlier. Although this property is not yet formally checked (tracked here), manually checking it shows that it is present. It also yields very similar (a bit faster actually) performance results, see below.

I've also rebased the branch on the current master, and changed the note in the Readme, that the assembly is no longer the hand optimized version, but the CryptOpt-generated one.

Those numbers are with clang -O3 -mtune=native -march=native on my 10 different machines. (Making it comparable with the third table in this comment, showing that the fiat_cryptopt column is a tad better.

implementation	default_asm	default_c	default_c52	fiat_c	fiat_c_narrow_int	fiat_cryptopt
ecmult_gen	15.7082	15.2509	15.6863	14.9906	14.548	13.9724
ecmult_const	29.4523	28.0577	29.2846	27.2908	27.6586	26.3218
ecmult_1p	23.3684	22.0285	23.1136	21.5132	21.9431	21.08
ecmult_0p_g	17.0308	15.9766	16.5105	15.3775	16.4462	15.7157
ecmult_1p_g	13.6926	12.9288	13.5101	12.6087	12.8326	12.3297
field_half	0.00518338	0.00520039	0.00509803	0.00509937	0.00501453	0.005152
field_normalize	0.00479336	0.00478621	0.00480812	0.00474333	0.00473504	0.00474445
field_normalize_weak	0.00284125	0.00284032	0.00284684	0.0028214	0.00281811	0.00281636
field_sqr	0.0136875	0.014248	0.014884	0.0135847	0.0117539	0.0119451
field_mul	0.0170591	0.0168615	0.0179167	0.0167125	0.0147756	0.0143076
field_inverse	1.52243	1.52486	1.52639	1.5235	1.51161	1.64494
field_inverse_var	0.927991	0.923316	0.9266	0.916022	0.913502	0.926668
field_is_square_var	1.21713	1.22058	1.21377	1.20963	1.20327	1.22883
field_sqrt	3.83393	3.11757	3.34495	3.74102	3.31233	3.40706

dderjoel · 2023-08-08T00:45:43Z

@andres-erbsen , you're saying the C code didn't change. Is there anything that we'd believe would change in the JSON? I wonder if I would need to give it a try to generate myself and check. (Reason why I think there could be changes is that I am removing all static_casts when preprocessing, and I do not check the bounds. Also, I assume that e.g. for an adx, the last parameter is a single bit.)

andres-erbsen · 2023-08-08T02:51:32Z

I would guess there are no relevant changes. If so, I expect the fiat-crypto equivalence checker would pass with the assembly files you have and --inbounds-multiplier 8.

dderjoel · 2023-08-08T05:23:26Z

Nice, I've double checked:
I've gotten the latest binaries from the Artifact of https://github.com/mit-plv/fiat-crypto/actions/runs/5721610921
Then I've checked all the current files including the ones for this PR and they all pass. with
./dettman_multiplication --inline --static --use-value-barrier secp256k1_dettman 64 5 48 2 '2^256 - 4294968273' $m --no-primitives --no-wide-int --shiftr-avoid-uint1 --output /dev/null --output-asm /dev/null --hints-file $f --inbounds-multiplier 8

dderjoel commented May 25, 2023

View reviewed changes

src/third_party/field_5x52_asm_impl_cryptopt.h Outdated Show resolved Hide resolved

sipa reviewed May 26, 2023

View reviewed changes

src/third_party/field_5x52_asm_impl_cryptopt.h Outdated Show resolved Hide resolved

sipa reviewed May 26, 2023

View reviewed changes

src/third_party/field_5x52_asm_impl_cryptopt.h Outdated Show resolved Hide resolved

dderjoel marked this pull request as ready for review May 29, 2023 07:53

sipa reviewed May 31, 2023

View reviewed changes

src/third_party/field_5x52_asm_impl_cryptopt.h Outdated Show resolved Hide resolved

real-or-random added assurance performance labels Jun 11, 2023

real-or-random reviewed Jun 20, 2023

View reviewed changes

dderjoel force-pushed the only-asm branch 2 times, most recently from 0b1c7ad to 47ddc46 Compare June 21, 2023 14:49

dderjoel force-pushed the only-asm branch from 47ddc46 to c010c15 Compare June 21, 2023 15:25

dderjoel marked this pull request as draft June 27, 2023 22:28

dderjoel mentioned this pull request Jun 28, 2023

Alternative to uint128 mit-plv/fiat-crypto#1560

Open

dderjoel force-pushed the only-asm branch 2 times, most recently from f500408 to b160ddd Compare July 1, 2023 04:32

dderjoel mentioned this pull request Jul 1, 2023

Feature requrest: enforce memory contraints 0xADE1A1DE/CryptOpt#167

Closed

3 tasks

dderjoel marked this pull request as ready for review July 1, 2023 08:30

khadem5677 approved these changes Jul 2, 2023

View reviewed changes

dderjoel added 3 commits July 7, 2023 09:40

build: not default to x64_asm. only use the asm if explicitly requested

e0494aa

added runtime check for bmi2 and adx

d38ab07

replace the asm implementation for mul/square inner

4742310

dderjoel force-pushed the only-asm branch from b160ddd to 4742310 Compare July 7, 2023 00:14

Merge branch 'bitcoin-core:master' into only-asm

3eaab69

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace ASM with CryptOpt generated #1329

Replace ASM with CryptOpt generated #1329

dderjoel commented May 25, 2023

dderjoel commented May 25, 2023

real-or-random commented May 25, 2023

real-or-random commented May 25, 2023

dderjoel commented May 29, 2023

dderjoel commented May 29, 2023

dderjoel commented May 29, 2023

dderjoel commented May 29, 2023

dderjoel commented Jun 7, 2023

dderjoel commented Jun 17, 2023

jonasnick commented Jun 20, 2023

real-or-random left a comment

real-or-random Jun 20, 2023

real-or-random Jun 29, 2023

dderjoel Jul 1, 2023

dderjoel commented Jun 21, 2023

dderjoel commented Jun 27, 2023

dderjoel commented Jun 28, 2023

dderjoel commented Jul 7, 2023

dderjoel commented Aug 8, 2023

andres-erbsen commented Aug 8, 2023 •

edited

Loading

dderjoel commented Aug 8, 2023

		secp256k1_fe_mul_inner(uint64_t r, const uint64_t a,
		const uint64_t *SECP256K1_RESTRICT b) {

Replace ASM with CryptOpt generated #1329

Are you sure you want to change the base?

Replace ASM with CryptOpt generated #1329

Conversation

dderjoel commented May 25, 2023

dderjoel commented May 25, 2023

real-or-random commented May 25, 2023

real-or-random commented May 25, 2023

dderjoel commented May 29, 2023

dderjoel commented May 29, 2023

dderjoel commented May 29, 2023

dderjoel commented May 29, 2023

dderjoel commented Jun 7, 2023

dderjoel commented Jun 17, 2023

jonasnick commented Jun 20, 2023

real-or-random left a comment

Choose a reason for hiding this comment

real-or-random Jun 20, 2023

Choose a reason for hiding this comment

real-or-random Jun 29, 2023

Choose a reason for hiding this comment

dderjoel Jul 1, 2023

Choose a reason for hiding this comment

dderjoel commented Jun 21, 2023

dderjoel commented Jun 27, 2023

dderjoel commented Jun 28, 2023

dderjoel commented Jul 7, 2023

dderjoel commented Aug 8, 2023

andres-erbsen commented Aug 8, 2023 • edited Loading

dderjoel commented Aug 8, 2023

andres-erbsen commented Aug 8, 2023 •

edited

Loading