Alternative to uint128 #1560

davidben · 2023-02-18T04:47:41Z

The 64-bit fiat-c files rely on uint128, which doesn't work in MSVC. Instead, MSVC has its own intrinsics. (Irritatingly, they're different between x86_64 and aarch64. There's _umul128 on x86_64 and then __umulh on aarch64.)

@andres-erbsen mentioned the --split-multiret exists which would get most of the way there, so if we could regenerate fiat-c with that, then we could probably, at least as a start, just patch out the resulting multiplication intrinsic and see how it works. And then perhaps, from there, figure out how to make the output do it automatically.

The text was updated successfully, but these errors were encountered:

davidben · 2023-02-18T04:48:17Z

(Some years ago, before Fiat outputted nice standalone files, Andres and I tried this and we instead crashed MSVC. But that was an older MSVC and Fiat's output has since changed, so I'm hopeful that it'll work now.)

andres-erbsen · 2023-02-24T19:28:39Z

Correction: code generated using --no-wide-int is on par with code generated without that flag on gcc iff addcarryx is replaced with an intrinsic and boringssl is built in release mode.

JasonGross · 2023-04-19T20:37:58Z

Assuming we want to match https://boringssl.googlesource.com/boringssl/+/aa31748bc84f0cd499e4b0337bf929b132aab1cc, we could add a --msvc flag that enables an adjustment to

fiat-crypto/src/Stringification/C.v

Lines 116 to 166 in 457cb5e

    
           Definition header 
        
                {language_naming_conventions : language_naming_conventions_opt} 
        
                {documentation_options : documentation_options_opt} 
        
                {package_namev : package_name_opt} 
        
                {class_namev : class_name_opt} 
        
                {output_options : output_options_opt} 
        
                (machine_wordsize : Z) (internal_static : bool) (static : bool) (prefix : string) (infos : ident_infos) 
        
                (typedef_map : list typedef_info) 
        
           : list string 
        
             := let bitwidths_used := bitwidths_used infos in 
        
                let value_barrier_bitwidths := value_barrier_bitwidths infos in 
        
                let FIAT_EXTENSION := (String.to_upper prefix ++ "FIAT_EXTENSION")%string in 
        
                ([""; 
        
                 "#include <stdint.h>"] 
        
                   ++ (if IntSet.mem _Bool bitwidths_used || IntSet.mem (int.signed_counterpart_of _Bool) bitwidths_used 
        
                       then ["typedef unsigned char " ++ int.type.to_string prefix _Bool ++ ";"; 
        
                               "typedef signed char " ++ int.type.to_string prefix (int.signed_counterpart_of _Bool) ++ ";"]%string 
        
                       else []) 
        
                   ++ (let gnuc_defines 
        
                           := (if IntSet.mem uint128 bitwidths_used || IntSet.mem int128 bitwidths_used 
        
                               then [(FIAT_EXTENSION, "__extension__")] 
        
                               else []) 
        
                                ++ [(FIAT_INLINE prefix, "__inline__")] 
        
                       in 
        
                       ["#if defined(__GNUC__) || defined(__clang__)"] 
        
                         ++ List.map (fun '(MACRO, primitive) => "#  define " ++ MACRO ++ " " ++ primitive)%string gnuc_defines 
        
                         ++ ["#else"] 
        
                         ++ List.map (fun '(MACRO, primitive) => "#  define " ++ MACRO)%string gnuc_defines 
        
                         ++ ["#endif"]) 
        
                   ++ (if IntSet.mem uint128 bitwidths_used || IntSet.mem int128 bitwidths_used 
        
                       then [""; 
        
                            FIAT_EXTENSION ++ " typedef signed __int128 " ++ int.type.to_string prefix int128 ++ ";"; 
        
                            FIAT_EXTENSION ++ " typedef unsigned __int128 " ++ int.type.to_string prefix uint128 ++ ";"]%string 
        
                       else []) 
        
                   ++ (if skip_typedefs 
        
                       then [] 
        
                       else List.flat_map 
        
                              (fun td_name => 
        
                                 match List.find (fun '(name, _, _, _) => (td_name =? name)%string) typedef_map with 
        
                                 | Some td_info => [""] ++ make_typedef prefix td_info 
        
                                 | None => ["#error ""Could not find typedef info for '" ++ td_name ++ "'"";"]%string 
        
                                 end%list) 
        
                              (typedefs_used infos)) 
        
                   ++ ["" 
        
                       ; "#if (-1 & 3) != 3" 
        
                       ; "#error ""This code only works on a two's complement system""" 
        
                       ; "#endif"] 
        
                   ++ (List.flat_map 
        
                         (value_barrier_func internal_static prefix) 
        
                         (IntSet.elements value_barrier_bitwidths)) 
        
                   ++ [""])%list.

to include

#include <intrin.h>
#if defined(_M_X64)
#include <immintrin.h>
#endif

and emits the relevant definitions for the primitives (presumably this is easiest to do if I add a new primitive ident for #if)

JasonGross · 2023-04-20T00:22:46Z

Hm, but it's not entirely clear how this should be structured. Maybe we should add support for each backend to describe a collection of "here's the string body of this function", to be used if the users request --primitives-via-intrinsics or something?

davidben · 2023-04-20T14:09:57Z

Also not wedded to the current the structure if something else be more convenient on the fiat side.

The main nuisance is that, when Andres and I tested it, generating MSVC-style code[*] for u128-capable targets produced dramatically worse code for Clang and GCC. It seems compilers are not very good at switching the u64 pairs back into a u128, and having them separate impedes some kind of analysis. :-(

Though it occurs to me we never tried teaching MSVC to fit the u128 APIs. That is, on MSVC, we could define u128 as a struct { uint64_t lo, hi; } and then implementing that with its intrinsics. Question is how good MSVC is at promoting structs to registers. Maybe something to try. Also it's a little subtle because the abstraction would need to capture that we never truly compute u128 x u128, only u64 x u64 -> u128.

[*] That is, we tried implementing the "intrinsics" using u128 and then splitting the u128s into pairs of u64s to use a single abstraction. Naively, one would think this works because a u128 needs to be split into u64s by the compiler anyway!

real-or-random · 2023-05-23T08:40:59Z

That is, on MSVC, we could define u128 as a struct { uint64_t lo, hi; } and then implementing that with its intrinsics.

This is what we do in libsecp256k1. See files https://github.com/bitcoin-core/secp256k1/blob/master/src/int128_struct_impl.h, https://github.com/bitcoin-core/secp256k1/blob/master/src/int128_struct.h and https://github.com/bitcoin-core/secp256k1/blob/master/src/int128.h.

This uint128 struct implementation was formally proven correct in VST by @roconnor-blockstream (at least when it was initially added), see bitcoin-core/secp256k1#1000 (comment). However, we don't have the proofs in our repo, and we don't check them continuously or adapt them to code changes. Perhaps @roconnor-blockstream maintains an up-to-date version?

Edit: Maybe adding code that uses a generic interface such as in https://github.com/bitcoin-core/secp256k1/blob/master/src/int128.h will be a good first step, even without including a (formally verified) implementation. As I understand, it would not really worsen the guarantees of fiat-crypto, because the conversion to strings is anyway not covered by the proofs. This would help us move forward with the integration in libsecp256k1.

andres-erbsen · 2023-05-23T11:18:48Z

Running fiat-crypto with --no-wide-int (like here) makes code without uint128. However, this code is slower than code that uses uint128 on some compilers. Further, the way to tease the compiler to get decently fast code seems to be a rabbithole, see here for an incomplete solution. Following that example, we could have a three-level system:

assembly for x86-64 with ADX,BMI,BMI2
uint128-based code as currently generated for compilers that support it
code with fiat-crypto --no-wide-int (uint128 inlined) for other compilers. This code would rely on a handwritten file defining operations on 64-bit integers in compiler- and architecture-specific ways
- using multi-return intrinsics such as *out1 = _umul128(arg1, arg2, out2) whenever available
- falling back to single-return intrinsics if needed *out2 = __umulh(arg1, arg2)
- falling back to polyfills implemented using uint64_t if absolutely necessary (WIP here).

I would be happy to assist with all required fiat-crypto adaptations and proofs of polyfills, but I would very much appreciate if somebody else did the job of hunting down appropriate incantations to get fast full addition, subtraction, multiplication, and cmov for 64-bit (and 32-bit?) numbers for different targets. I propose we'd then

check a huge file with ifdefs into fiat-crypto (or perhaps multiiple files ifdef-included would be better?)
have non-uint128-based implementations generated by fiat-crypto #include this file
copy-paste all three files to libsecp256k1 and other users of fiat-crypto-generated code

andres-erbsen · 2023-05-23T11:45:16Z

Here's a mashup of fiat-crypto secpk256k1 C code with uint128 only used to implement full 64-bit arithmetic: https://gist.github.com/andres-erbsen/7dd3ce006f7361840225f081bcdd9ad7 This should be a good starting point for experimentation with intrinsics and such.

davidben · 2023-05-23T12:58:35Z

That is, on MSVC, we could define u128 as a struct { uint64_t lo, hi; } and then implementing that with its intrinsics.

Ah that could also be worth trying out. We know that Clang and GCC generate worse code when MSVC's free-floating u64s are implemented in terms of u128, but we didn't try the converse. If u128 is implemented using a struct in just MSVC, presumably that won't impact GCC/Clang and hopefully MSVC can promote the struct to registers. Then we could do it in one file.

andres-erbsen · 2023-05-23T16:19:09Z

Indeed, polyfilling u128 sounds good as well. In case we do both, --no-wide-int would truly optional for templates that don't otherwise need intrinsics, whereas saturated arithmetic would still need intrinsics / other polyfills.

real-or-random · 2023-05-23T16:31:22Z

If u128 is implemented using a struct in just MSVC

Restricting this to MSVC is what we did, and it works rather well. __int128 is the "intrinsic" way on gcc/clang, so there's need to use something else for these compilers.

This should be a good starting point for experimentation with intrinsics and such.

At least the docs are good: https://learn.microsoft.com/en-us/cpp/intrinsics/x64-amd64-intrinsics-list?view=msvc-170 I think we could help here in general, but I don't have the time right now.

real-or-random · 2023-05-24T09:11:31Z

Here's a mashup of fiat-crypto secpk256k1 C code with uint128 only used to implement full 64-bit arithmetic: gist.github.com/andres-erbsen/7dd3ce006f7361840225f081bcdd9ad7 This should be a good starting point for experimentation with intrinsics and such.

I think these are all relevant intrinsics on x64 MSVC for this code:

_addcarry_u64, or _addcarryx_u64 on ADX
_subborrow_u64
__umul_h with _umul128, or _mulx_u64 on BMI2

If the goal is polyfilling u128, these are interesting, too:

__shiftleft128,
__shiftright128

And maybe these are interesting in general, I don't know:

_shlx_u64 on BMI2
_shrx_u64 on BMI2

Indeed, polyfilling u128 sounds good as well.

Would it be feasible to generate a mashup of secp256k1 C that uses functions like u128_mul, u128_add instead of operators on the native type? Then we could experiment with implementing it, using intrinsics and polyfills. I think our existing u128 code is a good starting point. Currently, it has only the ops we need in our code. For example, it doesn't have addition because our code is written to use accum_mul only. But the missing ops should be easy to add.

JasonGross · 2023-05-24T13:20:45Z

Would it be feasible to generate a mashup of secp256k1 C that uses functions like u128_mul, u128_add instead of operators on the native type?

This should be pretty easy on the Fiat Crypto codegen side: add an option --explicit-wide-int-functions that enables an extra rewriting pass that rewrites double-width operations to Z.add_modulo, etc, and add these functions to the collection of primitive functions that get searched for / processed / printed.

davidben · 2023-05-24T13:31:46Z

One detail to keep in mind: the natural mul function for u128 is not u128 x u128 -> u128, but u64 x u64 -> u128. That's the operation you get from the CPU and what MSVC gives you intrinsics for. (Indeed we're currently relying on GCC and Clang to recognize the lower bound on the casted inputs to generate fast and constant-time code.)

Addition is a bit more flexible, but if there are a few variants where the inputs have tighter bounds, it might help the compiler along slightly, depending on how smart MSVC is. That's probably a place to iterate on.

Also, since u128 will be abstracted, the casts will also need to go through functions.

dderjoel · 2023-06-21T15:12:42Z

Hi,

This should be pretty easy on the Fiat Crypto codegen side: add an option --explicit-wide-int-functions that enables an extra rewriting pass that rewrites double-width operations to Z.add_modulo, etc, and add these functions to the collection of primitive functions that get searched for / processed / printed.

I'd just kindly wanted to bump this and ask what the status of the rewrite pass --explcit-wide-int-functions would be. Jason, did you want to take care of this?

I feel, if we want to advance on bitcoin-core/secp256k1#1319 (using Fiat-Crypto generated C code in Bitcoin-core), we would need this.

Thanks!

JasonGross · 2023-06-21T15:39:35Z

No, I haven't had a chance to work on this, and don't expect to have time soon. @andres-erbsen are you working on this?

andres-erbsen · 2023-06-21T19:51:27Z

No, and I don't understand your suggestion. What does Z.add_modulo have to do with it?

andres-erbsen · 2023-06-21T21:43:15Z

I sketched up a bedrock2-style version that passes uint128 by address at https://godbolt.org/z/qe5veE4dE and it seems that clang allocates a needless stack frame in that style. Passing uint128 by pointer (or probably by value) does not have that issue, but that is not straightforward to support in Bedrock2. Thus the choice we make here potentially has implications for the goal of replacing the unverified C backend with an (improved) Bedrock2 backend.

JasonGross · 2023-06-21T23:42:18Z

No, and I don't understand your suggestion. What does Z.add_modulo have to do with it?

Probably Z.add_modulo is wrong, and I should have written Z.add_carry_full or whatever

andres-erbsen · 2023-06-22T15:47:04Z

I am further leaning against designing fiat-crypto around uint128 after learning more about it. Yes, it's currently 10% faster than our next best alternative on clang, but it's also nonstandard without a stable ABI and would require legwork to support in a more rigorous Bedrock2-based C backend. @dderjoel would you be willing to benchmark and investigate how well the approach based on mulx-style intrinsics works for secp256k1?

Here's a mashup of fiat-crypto secpk256k1 C code with uint128 only used to implement full 64-bit arithmetic: https://gist.github.com/andres-erbsen/7dd3ce006f7361840225f081bcdd9ad7 This should be a good starting point for experimentation with intrinsics and such.

dderjoel · 2023-06-27T22:12:33Z

It is actually part of the cryptopt generated code. It does not segfault, when I do not specify the -march= flag.

git clone https://github.com/dderjoel/secp256k1
cd secp256k1
git checkout only-asm
./autogen.sh
CC=clang CFLAGS='-march=skylake' ./configure --with-asm=x86_64
make clean bench_ecmult
./bench_ecmult

It segfaults at the ret sandwiched in two vzeroupper:

   0x555555585c52 <secp256k1_ecmult_odd_multiples_table+5298>:  pop    r15
   0x555555585c54 <secp256k1_ecmult_odd_multiples_table+5300>:  pop    rbp
   0x555555585c55 <secp256k1_ecmult_odd_multiples_table+5301>:  vzeroupper
=> 0x555555585c58 <secp256k1_ecmult_odd_multiples_table+5304>:  ret
   0x555555585c59 <secp256k1_ecmult_odd_multiples_table+5305>:  vzeroupper

I believe it is because I use rbp in the asm, because of two reasons:

Using CC=gcc CFLAGS='-march=skylake -O3 -mtune=skylake' ./configure --with-asm=x86_64, it complains that I use bp in the assembly:

In file included from src/secp256k1.c:27,
                 from src/bench_ecmult.c:8:
src/field_impl.h: In function ‘secp256k1_fe_sqrt’:
src/field_impl.h:158:1: error: bp cannot be used in ‘asm’ here
  158 | }
      | ^
make: *** [Makefile:1214: src/bench_ecmult-bench_ecmult.o] Error 1

Using only gcc -O2 -march=skylake -mtune=skylake or gcc -O3 -mtune=skylake it neither complains nor segfaults.

I have seen earlier (cannot reproduce it anymore) that the generated code does not respect rbp being clobbered. That is here: In line 24 rbp is used for some value, but clang replaces q0 with some [rbp + 0x8] or so... (I'm not 100% sure if it was q0 or q1 or q2)

real-or-random · 2023-06-27T22:19:32Z

Oh, you mean our benchmark segfaults, not clang itself! That could be as well be a bug on our side...

dderjoel · 2023-06-27T22:26:12Z

Oh, you mean our benchmark segfaults, not clang itself! That could be as well be a bug on our side...

Just updated the comment. That also means that we should not merge the code at the moment. There is no tests in the CI using -march=skylake with clang, is there? Or a test with gcc -O3 -march=skylake ?

andres-erbsen · 2023-06-28T01:31:38Z

Yep I think that gcc warning is onto something. Perhaps the =&m temporaries get encoded as rbp-relative references, which are bogus when rbp is temporarily repurposed for arithmetic use? I imagine that'd be a requirement for the conflicting use case of rsp-modifying asm. Concretely, for example here, imagine %q0 being ~~-0x40(%%rsp)~~ -0x40(%%rbp).

Also does libsecp256k1 support building with -fno-omit-frame-pointer?

dderjoel · 2023-06-28T05:29:58Z

Concretely, for example here, imagine %q0 being -0x40(%%rsp).

that should be okay, no? it's wrong it its based off rbp e.g. %q0 being -0x40(%%rbp). Or am I misunderstanding?

real-or-random · 2023-06-28T08:17:52Z

Yep, I think this is llvm/llvm-project#58528.

When I compile with -march=skylake -O0 -g and run this in gdb, then command disas tells me that the crash is in

 => 0x000055555556104c <+156>:	mov    %rdi,-0x38(%rbp)

which corresponds to the line

 "mov    %%rdi,%q0\n"

in the source. (This line: https://github.com/dderjoel/secp256k1/blob/c010c156424d5c06f9b6d16a839e04bbea5b31c1/src/third_party/field_5x52_asm_impl.h#L35 )

Look at the "addresses" of the variables here:

Program received signal SIGSEGV, Segmentation fault.
0x000055555556104c in secp256k1_fe_mul_inner (r=<error reading variable: Cannot access memory at address 0xffffffffffffffb0>, 
    a=<error reading variable: Cannot access memory at address 0xffffffffffffffa8>, 
    b=<error reading variable: Cannot access memory at address 0xffffffffffffffa0>) at src/third_party/field_5x52_asm_impl.h:13

Indeed, it seems that using rbp for arithmetic is not possible here, even if one lists it as clobbered.

(Happy to move this discussion to the PR, it's a bit off-topic here.)

dderjoel · 2023-06-28T09:00:00Z

Okay, thanks for investigating. (extended answer moved to PR)

But let me quickly come back on topic of uint128's. Based on Andres'

All in all I am leaning towards saying that uint128 is not worth supporting.

I feel we want something like the mashup, as the default and if one wants, --wide-int to emit code with uint128s? Do we want me to run again to get more performance numbers, especially with clang -O3 -march=native (I'll fix my scripts to exclude the assembly for now then.)?

(Sorry for getting off topic again)
For the bitcoin-core/secp256k1#1319, I think we would then want code like the one in the current benchmark, but generated entirely from the fiat pipeline, right?

andres-erbsen · 2023-06-28T12:08:17Z

Yeah that sounds good, we can keep uint128 behind a flag in cases its already implemented, but off by default. The Bedrock2 backend never used uint128 outside its mulx function. In the glorious future where the Bedrock2-based backend becomes the default, we may consider dropping the wholly unverified C backend.

I would appreciate performance numbers, but no rush. I'm out for a much-needed vacation starting this weekend so probably I won't look at anything until two weeks from now.

If I was doing bitcoin-core/secp256k1#1319 I would rerun fiat-crypto with the command at the top of the benchmark file (here) and diff the output with the benchmark file, then flag any differences for manual review (I hope it's just the ifdefed functions). It's not like we can prove anything useful about intrinsics anyway.

real-or-random · 2023-06-28T14:51:08Z

All in all I am leaning towards saying that uint128 is not worth supporting.

Yeah that sounds good, we can keep uint128 behind a flag in cases its already implemented, but off by default.

Hm, I see. I can't speak for the other contributors, but I think this makes fiat-crypto the C output much less attractive for libsecp256k1. The reasons are:

We have a formally verified polyfill for uint128_t using a struct (for MSVC), and we use uint128 in other places in our code. (Namely, in the scalar field implementation. This could in principle also be replaced by fiat-crypto output in the long term, but that needs to be considered separately.)
We care about performance on GCC, and the benchmarks show that we easily lose > 10% for EC multiplication.

I think one possible way forward for us is to use fiat-crypto's C output (with uint128) and rewrite it manually to use our generic uint128 interface. Since the backend is not formally verified, this would not forego any formal guarantees. The result of the rewriting should be equivalent to our code. In fact, if it turns out to be equivalent, then we don't even need to replace our code. (That sounds a bit silly or even sad then, but it still would give us a lot of confidence that our code implements a correct algorithm.)

andres-erbsen · 2023-06-28T15:43:03Z

Thank you for the feedback. I am thinking whether there's something we can do along the lines of what you suggested but less manual. The Coq code that prints the C you would be rewriting is here and if only it had types there it could just be changed to print uint128 operations as function calls. For multiplication it would probably suffice to check whether one argument has a cast to uint128, and maybe a similar recursive check would work for uint128 add, but I am not sure. If this sounds like it would do the thing for you, I am happy to give it a try.

On what architectures do you care about performance with GCC?

real-or-random · 2023-06-28T16:16:00Z

Thank you for the feedback. I am thinking whether there's something we can do along the lines of what you suggested but less manual. The Coq code that prints the C you would be rewriting is here and if only it had types there it could just be changed to print uint128 operations as function calls. For multiplication it would probably suffice to check whether one argument has a cast to uint128, and maybe a similar recursive check would work for uint128 add, but I am not sure. If this sounds like it would do the thing for you, I am happy to give it a try.

Okay, yes, this is what I brougt up earlier, and this would, of course, be optimal for us.

On what architectures do you care about performance with GCC?

This is probably not a very focused answer, but I think the story here is that aim for decent performance on GCC and clang. Official Bitcoin Core builds use GCC for Linux and Windows, and clang for macOS, which is something we keep in mind in libsecp256k1, but we really try to offer a portable library that works well more or less everywhere, and don't want to force our users to use a specific compiler. Architectures where we really care about performance are x86_64 and maybe ARM64. Hardware wallets are also interesting, but the landscape there is a bit more complicated (some are ARM32 for example.)

andres-erbsen · 2023-06-28T22:15:25Z

Ok I started a prototype to print generic uint128 code at andres-erbsen@3626fec and kicked off a compile. This exact change is not enough, but if it does the right thing for the handled case, more of the same should be straightforward. I forget what the expected compilation time for secp256k1 in fiat-crypto is, but seconds wasn't enough. (The point in comment you link to still misses me, but alas, I think I now understand what you would like to happen here.)

The performance answer is along the lines of what I was hoping for, thanks. Just to clarify: is performance with GCC on x86-64 important or not given that there is also an assembly implementation?

dderjoel · 2023-07-01T08:30:38Z

I'm back with more numbers. I've fixed the segfault in fiat_cryptopt and could run all the benchmarks across my machines again.
I'll be comparing here {gcc,clang} {-O3,-O2} -march=native -mtune=native; and the default compilations {gcc,clang}, which should result in -O2, with a more portable binary.
Spoiler: With clang, fiat_c_narrow_int is actually a tad faster than the current c implementation.

clang (default)

implementation	default_asm	default_c	default_c52	fiat_c	fiat_c_narrow_int	fiat_cryptopt
ecmult_gen	14.4	14.8685	15.509	14.4466	14.2407	13.1947
ecmult_const	28.6379	28.9282	30.4886	28.0935	28.1623	26.1167
ecmult_1p	22.6069	22.4785	23.6166	21.6802	22.1214	20.6979
ecmult_0p_g	16.4805	16.2687	17.058	15.6537	15.9589	15.0235
ecmult_1p_g	13.2241	13.1847	13.802	12.7043	12.893	12.2081
field_half	0.00245578	0.0023995	0.00228864	0.00237679	0.0024416	0.00238009
field_normalize	0.00694667	0.006916	0.00698636	0.00692382	0.0069511	0.00692881
field_normalize_weak	0.0028189	0.0028261	0.00281602	0.00278509	0.00279863	0.00279448
field_sqr	0.0135209	0.0127022	0.0138531	0.0128885	0.0122077	0.0120565
field_mul	0.0168353	0.0164905	0.0167452	0.0157152	0.0155041	0.0144297
field_inverse	1.43893	1.42658	1.43585	1.42449	1.43356	1.61813
field_inverse_var	0.95799	0.951122	0.952589	0.949808	0.953455	0.960287
field_is_square_var	1.25882	1.25678	1.25474	1.24586	1.25356	1.26345
field_sqrt	3.73555	3.4951	3.78312	3.51957	3.40714	3.33102

clang -O2 -march=native -mtune=native

implementation	default_asm	default_c	default_c52	fiat_c	fiat_c_narrow_int	fiat_cryptopt
ecmult_gen	15.635	15.5033	16.0575	15.0982	14.5795	14.0249
ecmult_const	29.3735	28.4725	29.9029	27.3628	27.5755	26.5836
ecmult_1p	23.3621	22.3728	23.4829	21.4641	22.0046	21.3766
ecmult_0p_g	17.0755	16.1995	16.9391	15.384	16.6517	15.6403
ecmult_1p_g	13.6394	13.0701	13.6493	12.6048	12.8295	12.477
field_half	0.00518533	0.00509934	0.00501634	0.00509551	0.00509858	0.00517301
field_normalize	0.00478045	0.004775	0.00477195	0.00474356	0.00473628	0.00472487
field_normalize_weak	0.00283434	0.00282166	0.00284423	0.00280772	0.00283141	0.00282065
field_sqr	0.0136087	0.0142888	0.0147477	0.0134873	0.0117804	0.0120515
field_mul	0.0169551	0.0169624	0.0176455	0.0165582	0.0147279	0.0144852
field_inverse	1.55314	1.52646	1.51292	1.50915	1.51043	1.64765
field_inverse_var	0.923946	0.922127	0.923942	0.916541	0.916457	0.927074
field_is_square_var	1.21189	1.22217	1.21449	1.21058	1.20822	1.22949
field_sqrt	3.81432	3.85748	4.02828	3.72946	3.3191	3.35218

clang -O3 -march=native -mtune=native

implementation	default_asm	default_c	default_c52	fiat_c	fiat_c_narrow_int	fiat_cryptopt
ecmult_gen	15.5047	15.2393	15.731	14.99	14.5696	14.1213
ecmult_const	29.3673	28.0945	29.2459	27.2205	27.6556	26.6799
ecmult_1p	23.3116	22.1069	23.0035	21.4319	22.085	21.3718
ecmult_0p_g	17.012	15.8731	16.55	15.304	16.2868	15.6925
ecmult_1p_g	13.6059	13.0403	13.5035	12.587	12.8417	12.4987
field_half	0.00522527	0.00511302	0.00503948	0.00498669	0.00505752	0.00513952
field_normalize	0.00475579	0.00475721	0.00475133	0.00471258	0.00473416	0.00479262
field_normalize_weak	0.00282931	0.00284143	0.00280667	0.00279738	0.00282548	0.0028323
field_sqr	0.0136073	0.014192	0.0146883	0.0134972	0.01179	0.012144
field_mul	0.0169033	0.0169013	0.0177466	0.0165226	0.0146773	0.0144688
field_inverse	1.51547	1.52616	1.51278	1.50301	1.50377	1.65712
field_inverse_var	0.917597	0.920323	0.912464	0.910969	0.915148	0.932548
field_is_square_var	1.21427	1.21174	1.20388	1.19847	1.19944	1.23266
field_sqrt	3.82321	3.10106	3.30623	3.68093	3.30375	3.39914

gcc (default)

implementation	default_asm	default_c	default_c52	fiat_c	fiat_c_narrow_int	fiat_cryptopt
ecmult_gen	15.8998	15.0468	15.8134	15.0307	16.8996	14.7032
ecmult_const	30.1216	28.1942	30.3036	28.2912	32.3142	27.2834
ecmult_1p	23.8669	21.9609	23.6295	22.2062	25.2611	21.267
ecmult_0p_g	17.0692	15.8632	16.9321	15.8502	17.8889	15.1963
ecmult_1p_g	13.9171	12.8349	13.7314	12.8807	14.7508	12.4131
field_half	0.00260147	0.00239082	0.00246679	0.0025392	0.00246015	0.00242672
field_normalize	0.00745444	0.00746962	0.00734761	0.00730362	0.00734994	0.00747002
field_normalize_weak	0.00297391	0.0029576	0.0029316	0.00292127	0.00290907	0.00294705
field_sqr	0.0140496	0.0126285	0.0137609	0.0124719	0.0155721	0.0122619
field_mul	0.0170276	0.0143555	0.0156142	0.0146069	0.0185516	0.0147299
field_inverse	1.40303	1.40325	1.39395	1.38897	1.38745	1.62885
field_inverse_var	0.925005	0.917206	0.911152	0.905561	0.908435	0.918524
field_is_square_var	1.2015	1.20861	1.19965	1.19366	1.19798	1.20488
field_sqrt	3.87766	3.57096	3.83902	3.44814	4.24594	3.42435

gcc -O2 -march=native -mtune=native

implementation	default_asm	default_c	default_c52	fiat_c	fiat_c_narrow_int	fiat_cryptopt
ecmult_gen	16.721	16.2147	17.1447	16.1082	17.4143	15.2807
ecmult_const	31.2689	30.1828	32.2569	29.9705	32.9838	27.977
ecmult_1p	24.591	23.4126	25.1881	23.2472	25.6367	21.8179
ecmult_0p_g	17.4537	16.817	17.8592	16.6341	18.4228	15.6129
ecmult_1p_g	14.3136	13.6362	14.5647	13.5935	15.0641	12.8132
field_half	0.00264791	0.00252462	0.00238327	0.00236645	0.00232041	0.00241018
field_normalize	0.00810289	0.00792896	0.00805239	0.00794354	0.00782179	0.00812656
field_normalize_weak	0.00299978	0.00292424	0.00292019	0.00294033	0.00288756	0.00299396
field_sqr	0.0142147	0.0139958	0.0154915	0.013651	0.0169165	0.0122637
field_mul	0.0173362	0.0159168	0.0171377	0.016962	0.0190466	0.0147364
field_inverse	1.47347	1.44537	1.447	1.44393	1.42127	1.69394
field_inverse_var	0.929955	0.905406	0.90717	0.906503	0.893355	0.930748
field_is_square_var	1.2179	1.20598	1.20458	1.20179	1.18497	1.21568
field_sqrt	3.94708	3.95934	4.33316	3.78405	4.60921	3.468

gcc -O3 -march=native -mtune=native

implementation	default_asm	default_c	default_c52	fiat_c	fiat_c_narrow_int	fiat_cryptopt
ecmult_gen	16.7701	16.1389	16.8432	16.0805	17.706	15.2464
ecmult_const	30.9871	29.0426	31.3515	29.4948	32.6941	27.7546
ecmult_1p	24.7098	23.1923	24.5407	23.0531	25.5556	21.5714
ecmult_0p_g	17.757	16.7931	17.6227	16.5752	18.3798	15.6301
ecmult_1p_g	14.3999	13.6311	14.2793	13.4965	14.913	12.5407
field_half	0.00255814	0.00250539	0.00235872	0.00242583	0.00232508	0.00246923
field_normalize	0.00709768	0.00690524	0.00689225	0.00685556	0.0068617	0.00708038
field_normalize_weak	0.00301215	0.0029099	0.00291935	0.00287837	0.00288088	0.00299388
field_sqr	0.0140762	0.012191	0.0142837	0.0168347	0.0147925	0.0123209
field_mul	0.017737	0.0141453	0.0154513	0.0162763	0.0193284	0.0148816
field_inverse	1.45047	1.44391	1.44917	1.41797	1.41224	1.66926
field_inverse_var	0.896581	0.876363	0.873395	0.863774	0.866151	0.906099
field_is_square_var	1.25105	1.2159	1.21284	1.19329	1.20306	1.24462
field_sqrt	3.91907	3.51509	3.791	4.35562	4.15031	3.43451

real-or-random · 2023-07-03T15:20:46Z

Just to clarify: is performance with GCC on x86-64 important or not given that there is also an assembly implementation?

Good point. Yeah, if you ask me, I think it's reasonable to care a bit less about the performance of the C code on x86-64 given that we'll have assembly there.

real-or-random · 2023-07-03T15:25:33Z

Ok I started a prototype to print generic uint128 code at andres-erbsen@3626fec and kicked off a compile. This exact change is not enough, but if it does the right thing for the handled case, more of the same should be straightforward.

Yep, looks great. As far as I can understand the diff, this is what I had in mind.

dderjoel · 2023-07-25T09:26:03Z

Hi everyone, just a quick follow up question on what the status is here.
If I recall correctly, Andres should be back, and he has created some changes to print uint128 code. What was the result of that? And how would we proceed? Can I help with anything? (I'll have disruption in checking on my arsenal of machines mid September, If we'd want to have some benchmarks for that code)

andres-erbsen · 2023-07-25T11:34:17Z

I am back and I should work on it more. That build never finished, so it'll be a fun debugging session.

andres-erbsen · 2023-07-25T17:12:40Z

master...andres-erbsen:fiat-crypto:uint128-extern generates C code whose compilation only errors on implicit declaration of function 'u128_mul_u64_u64' and so on. No proofs, no testing. I wouldn't be too surprised if it works anyway.

dderjoel · 2023-07-26T04:57:23Z

Alright, cool. If I remember correctly, we wanted to use this implementation for field mul in combination with the already verified primitives (u128 \gets u64*u64, etc), right?

I'm trying to incorporate that, and then see what works and what does not.
The first issue I see is that bitcoin's mul is secp256k1_u128_mul(secp256k1_u128* r, uint64_t a, uint64_t b), and in Andres' version it is fiat_secp256k1_dettman_uint128 r = u64_mul_u64_u64(uint64_t a, uint64_t b); Could we change that to match the bitcoin's one?

I thought I could for now just use search and replace to get the result symbol in there as the first argument, but that is a bit too tricky with all those implicit values in eg. u128_add_u128_u128(&x4, u128_add_u128_u128(u128_mul_u64_u64((arg1[0]), (arg2[3])), u128_add_u128_u128(u128_mul_u64_u64((arg1[1]), (arg2[2])), u128_add_u128_u128(u128_mul_u64_u64((arg1[2]), (arg2[1])), u128_mul_u64_u64((arg1[3]), (arg2[0]))))), u128_mul_u64_u64(x3, UINT64_C(0x1000003d10)));

Then I see that in Bitcoin's version we only have the rshift as an in-place function. Is there any uint128 copy primitive?
(I could think of getting hi and low with
uint64_t secp256k1_u128_to_u64(const secp256k1_uint128 *a);
uint64_t secp256k1_u128_hi_u64(const secp256k1_uint128 *a);
and then set a new u128 with
void secp256k1_u128_load(secp256k1_uint128 *r, uint64_t hi, uint64_t lo);
but that seems a bit verbose, and I don't know what the compiler will do.
)

Also I do not see an u64_and_u128_u64(uint64_t *r, u128 a, u64 b).
would

void u64_and_u128_u64(uint64_t *r, const secp256k1_uint128 *a, u64 b){
    uint64_t t1 = secp256k1_u128_to_u64(a);
    *r = t1 & b;
}

work?

real-or-random · 2023-07-26T11:47:21Z

Then I see that in Bitcoin's version we only have the rshift as an in-place function. Is there any uint128 copy primitive?

You can use the assignment operator to copy:

secp256k1_uint128 copy = orig;

Also I do not see an u64_and_u128_u64(uint64_t *r, u128 a, u64 b). would

void u64_and_u128_u64(uint64_t *r, const secp256k1_uint128 *a, u64 b){
    uint64_t t1 = secp256k1_u128_to_u64(a);
    *r = t1 & b;
}

work?

looks correct to me

andres-erbsen · 2023-07-26T16:08:16Z

How about the following?

secp256k1_u128 u128_mul_u64_u64(uint64_t a, uint64_t b) {
  secp256k1_u128 r;
  secp256k1_u128_mul(&r, a, b);
  return r;
}

roconnor-blockstream · 2023-07-26T16:14:32Z

AFAICT, libsecp256k1 never returns structures from functions.

I'm not entirely sure if this is deliberate or not, but I can add, FWIW, that VST is unable to reason about functions that return structures or take structures as arguments by copy or copying structures by assignment, none of which happen in libsecp256k1 that I am aware of.

VST being https://vst.cs.princeton.edu/, the tool I've been using to reason about C code.

andres-erbsen · 2023-07-26T17:04:12Z

My (fallible) recollection is that the VST limitation originates from a similar limitation CompCert used to have. Given how simple these wrappers are, I think it is fine to not prove anything about them for now. Is there another reason not to pass structs by value?

roconnor-blockstream · 2023-07-26T17:07:26Z

The derailing of VST would be more than just the wrappers, Any function that calls these wrappers also couldn't be reasoned about.

That said, libsecp256k1 didn't conform to this restriction for the sake of VST; it was like that before I even started. Nor do I necessarily expect that libsecp256k1 maintain these restrictions just for the sake of VST, though it might be worth being at least aware of.

andres-erbsen · 2023-07-26T17:40:55Z

Yeah, I don't think you'd want to use that wrapper anywhere outside fiat-crypto-generated code. Generating struct-by-pointer from the fiat-crypto C backend seems like it could be a rabbithole, so I'd rather not look into that unless there is a specific reason. As discussed above, we already have uint64s-by-pointer code and now the struct-by-value branch here. Adding struct-by-pointer support would be easier (and have proofs) in the bedrock2-based backend, but that's not mature enough yet for me to want to propose its use here.

dderjoel · 2023-07-26T23:27:06Z

okay,
I've got some polyfills

inline secp256k1_uint128 u128_mul_u64_u64(uint64_t a, uint64_t b) {
  secp256k1_uint128 r;
  secp256k1_u128_mul(&r, a, b);
  return r;
}
inline uint64_t u64_shr_u128(secp256k1_uint128 a, unsigned int n) {
  secp256k1_uint128 r = a;
  secp256k1_u128_rshift(&r, n);
  return secp256k1_u128_to_u64(&r);
}

inline uint64_t u64_and_u128_u64(secp256k1_uint128 a, uint64_t b) {
  return secp256k1_u128_to_u64(&a) & b;
}

inline secp256k1_uint128 u128_add_u64_u128(uint64_t a, secp256k1_uint128 b) {
  secp256k1_uint128 r = b;
  secp256k1_u128_accum_u64(&r, a);
  return r;
}

inline secp256k1_uint128 u128_add_u128_u128(secp256k1_uint128 a, secp256k1_uint128 b) {
  secp256k1_uint128 r = a;
  // adding low b
  secp256k1_u128_accum_u64(&r, b);

  // adding high b
  // ???

  return r;
}

But I am missing something to add two u128: u128_add_u128_u128. I can only see the secp256k1_u128_accum_u64 to add the low 64 bits. Am I missing something ?

roconnor-blockstream · 2023-07-26T23:42:48Z

u128_add_u128_u128 is never preformed in the current libsecp256k1 implementation.

Edit: I stand corrected 👇

andres-erbsen · 2023-07-27T00:35:28Z

Here's how libsecp256k1 adds uint128

dderjoel · 2023-07-27T01:44:34Z

I've done it in https://github.com/dderjoel/secp256k1/blob/only-c/src/field_5x52_int128_impl.h#L60-L82

And updated the PR bitcoin-core/secp256k1#1319
Even all the test run smoothly.

This comment was marked as outdated.

Sign in to view

JasonGross mentioned this issue Apr 19, 2023

CI C tests fail #1592

Closed

real-or-random mentioned this issue May 23, 2023

Replace the 64-bit C field implementation by fiat-crypto output bitcoin-core/secp256k1#1319

Open

This comment was marked as outdated.

Sign in to view

davidben closed this as completed May 24, 2023

This comment was marked as outdated.

Sign in to view

davidben reopened this May 24, 2023

andres-erbsen changed the title ~~fiat-c 64-bit files don't build on MSVC~~ Alternative to uint128 Jun 22, 2023

This comment was marked as duplicate.

Sign in to view

dderjoel mentioned this issue Jun 27, 2023

Replace ASM with CryptOpt generated bitcoin-core/secp256k1#1329

Open

Alternative to uint128 #1560

Alternative to uint128 #1560

Comments

davidben commented Feb 18, 2023

davidben commented Feb 18, 2023 • edited Loading

This comment was marked as outdated.

This comment was marked as outdated.

andres-erbsen commented Feb 24, 2023 • edited Loading

JasonGross commented Apr 19, 2023

JasonGross commented Apr 20, 2023

davidben commented Apr 20, 2023 • edited Loading

real-or-random commented May 23, 2023 • edited Loading

andres-erbsen commented May 23, 2023 • edited Loading

andres-erbsen commented May 23, 2023

davidben commented May 23, 2023 • edited Loading

andres-erbsen commented May 23, 2023

real-or-random commented May 23, 2023

real-or-random commented May 24, 2023 • edited Loading

JasonGross commented May 24, 2023 • edited Loading

This comment was marked as outdated.

This comment was marked as outdated.

davidben commented May 24, 2023 • edited Loading

dderjoel commented Jun 21, 2023

JasonGross commented Jun 21, 2023

andres-erbsen commented Jun 21, 2023

andres-erbsen commented Jun 21, 2023 • edited Loading

JasonGross commented Jun 21, 2023

andres-erbsen commented Jun 22, 2023

This comment was marked as duplicate.

dderjoel commented Jun 27, 2023 • edited Loading

real-or-random commented Jun 27, 2023

dderjoel commented Jun 27, 2023

andres-erbsen commented Jun 28, 2023 • edited Loading

dderjoel commented Jun 28, 2023

real-or-random commented Jun 28, 2023

dderjoel commented Jun 28, 2023

andres-erbsen commented Jun 28, 2023 • edited Loading

real-or-random commented Jun 28, 2023

andres-erbsen commented Jun 28, 2023

real-or-random commented Jun 28, 2023

andres-erbsen commented Jun 28, 2023 • edited Loading

dderjoel commented Jul 1, 2023

real-or-random commented Jul 3, 2023

real-or-random commented Jul 3, 2023

dderjoel commented Jul 25, 2023

andres-erbsen commented Jul 25, 2023

andres-erbsen commented Jul 25, 2023

dderjoel commented Jul 26, 2023

real-or-random commented Jul 26, 2023

andres-erbsen commented Jul 26, 2023

roconnor-blockstream commented Jul 26, 2023 • edited Loading

andres-erbsen commented Jul 26, 2023

roconnor-blockstream commented Jul 26, 2023

andres-erbsen commented Jul 26, 2023

dderjoel commented Jul 26, 2023

roconnor-blockstream commented Jul 26, 2023 • edited Loading

andres-erbsen commented Jul 27, 2023 • edited Loading

dderjoel commented Jul 27, 2023

davidben commented Feb 18, 2023 •

edited

Loading

andres-erbsen commented Feb 24, 2023 •

edited

Loading

davidben commented Apr 20, 2023 •

edited

Loading

real-or-random commented May 23, 2023 •

edited

Loading

andres-erbsen commented May 23, 2023 •

edited

Loading

davidben commented May 23, 2023 •

edited

Loading

real-or-random commented May 24, 2023 •

edited

Loading

JasonGross commented May 24, 2023 •

edited

Loading

davidben commented May 24, 2023 •

edited

Loading

andres-erbsen commented Jun 21, 2023 •

edited

Loading

dderjoel commented Jun 27, 2023 •

edited

Loading

andres-erbsen commented Jun 28, 2023 •

edited

Loading

andres-erbsen commented Jun 28, 2023 •

edited

Loading

andres-erbsen commented Jun 28, 2023 •

edited

Loading

roconnor-blockstream commented Jul 26, 2023 •

edited

Loading

roconnor-blockstream commented Jul 26, 2023 •

edited

Loading

andres-erbsen commented Jul 27, 2023 •

edited

Loading