Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove wasteful safer Inet4 lookups #276

Closed
wants to merge 1 commit into from

Conversation

franz1981
Copy link
Contributor

No description provided.

@franz1981 franz1981 force-pushed the main_no_concat_indy branch from e0f3b3c to a2c7235 Compare January 15, 2024 12:48
@franz1981
Copy link
Contributor Author

franz1981 commented Jan 15, 2024

@radcortez can you check if the ones happening in clinit are the one you saw in the flamegraphs?

here

image

i can see many, not just the 3 ones I've fixed.

@franz1981
Copy link
Contributor Author

Ah no, the flamegraph is per line of code, so the fix here should take care of it already

@franz1981 franz1981 force-pushed the main_no_concat_indy branch from a2c7235 to 3a6d6da Compare January 15, 2024 13:14
@radcortez
Copy link
Member

Here is the flamegraph with the current behaviour:
alloc-999.html.zip

And with the change:
alloc-999-sr-common-276.html.zip

@franz1981
Copy link
Contributor Author

franz1981 commented Jan 15, 2024

Thanks @radcortez

so

image

The overall saving as "allocation bamdwidth" is ~217KB, due to removing the indy allocation, which "can" translate in more work to the JIT as shown by quarkusio/quarkus#36204.

If you want to give this a shot you could profile with the -e malloc --total argument which should (see https://github.com/async-profiler/async-profiler/blob/8910a7a46241abc7cb5b80ba6717c3eab603aae6/src/perfEvents_linux.cpp#L194) report the size of native allocations to see if something evident has changed around the mentioned hot parts involved in JIT compiler work (see quarkusio/quarkus#36204 (comment)).
This will give the overall footprint saving, apart from the 217 KB reported here (which is still a just once cost).

The overall heap allocation pressure reduction between the 2 versions is ~0.8% which is tiny amount, but still measurable; with the native one you can get the global picture of the improvement, if you wish.

@dmlloyd
Copy link
Collaborator

dmlloyd commented Jan 16, 2024

The diff as given doesn't seem to do anything to remove an indy. Also as I said on the WF common change, I think that it's better to use new String(bytes, StandardCharsets.ISO_8859_1). @franz1981 referenced a bug (JDK-8295496) as reason why that wasn't done but the bug seemed to relate to a different ctor, relating to copyOfRange? Maybe it was the wrong #. The code path for the ctor I give above should be nearly identical to the deprecated one used in the PR.

@radcortez
Copy link
Member

Yes, I confirm an improvement in allocs RSS for native as well with this PR.

It is somehow significant, but I only noticed it because it was in my analysis path, and it wasn't showing up before on previous versions. We should move forward, but we need to clarify @dmlloyd suggestion first.

@dmlloyd
Copy link
Collaborator

dmlloyd commented Jan 22, 2024

How are you measuring, and are you comparing against main? I believe that we are talking about a difference of three byte[4] plus six byte[7 <= n <= 15] plus three String instances allocated out of TLAB during class initialization. I would be astonished if that showed up on RSS reports or CPU time reports compared to our (thousands of) other class initializations.

I'm not arguing against the change per se, but if we're justifying any additional complexity by a measurable performance improvement, then the measurements had better show a real-world improvement that is greater than the margin of error.

@franz1981
Copy link
Contributor Author

@dmlloyd

@franz1981 referenced a bug (JDK-8295496) as reason why that wasn't done but the bug seemed to relate to a different ctor, relating to copyOfRange? Maybe it was the wrong #

Nope, sadly the JDK issue is correct, just Claes probaby didn't reported in the issue itself the full analysis which brought me to prepare the reproducer for him, but I will later attach my comment to the PR which explain something more about it.

In the original benchmark reported in the issue, you can find https://gist.github.com/franz1981/bf67ed8f328ed112a524c1150833c718#file-asciicopy-java-L73 which is indeed simplifying the life of the JIT during the generation of the array's copy stub, by having separate call-sites which:

  • allow the JIT to propagate the information that we always pass an array which length is what we use as String's parameter: given that this information is constant and propagated in that version of the method, it would constant-fold some (but not all! that's why the JDK issue is still opened!) the other checks
  • this information about array's length propagate till the stub generation which would skip these bound checks, making it as minimal as possible (although never as good the indy which is using an hidden String constuctor and not performing any - sigh - vectorized copy)

Which translate in the conclusion that having a properly sized array payoff in the already slower (if compared to indy) String creation, at 2 level:

  • the String checks itself
  • the array's copy one

Related using the String constructor which suggest the encoding type, instead, I can quote what Claes said

I also included String(byte[], int, int, ISO_8859_1) which disappointingly lags slightly behind the deprecated String(byte[], int, int, int).

Although on the PR at openjdk/jdk#12453 (comment) I cannot see any mention about it in the main PR comment, but my previous tests report that one as the worst method to use; I can re-verify it, in case.

The changes at https://bugs.openjdk.org/browse/JDK-8301958 are instead providing a "better" copy method, using the same trick I've suggested, but it will improve just the copy, and not the inherent checks in String itself, which will still be performed (and that's why https://bugs.openjdk.org/browse/JDK-8295496 is still opened, despite the copy improvement!).

Said that, these are facts which should be weighted with the perceived (it's personal, but vs the original version is clearly worsen now!) increase of code complexity vs the actual gain: I've no problem to change the PR or close it and let anyone to propose a different change which still improve the RSS and native usage as this one; the runtime performance is sadly already reduced because of not using the indy, but probably is a small drop in the ocean, if is not the hottest path!

@dmlloyd
Copy link
Collaborator

dmlloyd commented Jan 22, 2024

I think the loss of the indy is not a problem for the same reason that the non-deprecated ctor usage is not a problem: it is indeed a drop in the ocean. We do not create many of these address instances in this way, normally. Certainly not on any hot paths. The purpose of these methods is to construct addresses for which a RDNS lookup is never performed, and the code that exists does that pretty well.

The primary question is, do we see a measurable improvement with the patch as it stands, in a real-world application (e.g. a Quarkus application), which is greater than the error%? If not, then I don't think the complexity is warranted (in addition to the message "we provide these methods to construct an address, but do not use them because they are too slow").

@dmlloyd
Copy link
Collaborator

dmlloyd commented Jan 22, 2024

Addendum: In retrospect, the best approach would have been to create these as static methods with return a dynamic constant that lazily constructs each address. But, the only way we could possibly move to that approach (leaving out the necessary bytecode processing, which is not trivial) would be to deprecate the fields, which would not fix the problem until they are actually removed. So, I'd say it's too late for that.

@franz1981
Copy link
Contributor Author

franz1981 commented Jan 22, 2024

The primary question is, do we see a measurable improvement with the patch as it stands, in a real-world application (e.g. a Quarkus application), which is greater than the error%?

I can share what I've learnt at #276 (comment): the difference exists in quickstart uses cases, which, although are not representative of any "real" usage IMO, are still used by some users (eg some experts which can share RSS random numbers on their social accounts) - which can affect us, somehow.
Maybe @radcortez got some different type of tests, but I believe that ~217KB is still small but surprisingly high cost for something like that (and I didn't checked the RSS native footprint yet) and if we can avoid users to complain about it in public, spending even more cycles, later on, it's an (half) win.

@dmlloyd
Copy link
Collaborator

dmlloyd commented Jan 22, 2024

The primary question is, do we see a measurable improvement with the patch as it stands, in a real-world application (e.g. a Quarkus application), which is greater than the error%?

I can share what I've learnt at #276 (comment): the difference exists in quickstart uses cases, which, although are not representative of any "real" usage IMO, are still used by some users (eg some experts which can share RSS random numbers on their social accounts) - which can affect us, somehow. Maybe @radcortez got some different type of tests, but I believe that ~217KB is still small but surprisingly high cost for something like that (and I didn't checked the RSS native footprint yet) and if we can avoid users to complain about it in public, spending even more cycles, later on, it's an (half) win.

Where do you get ~217KB from? Upstream already has the indy-removing change AFAICT... this patch only seems to reuse byte[]. Did I miss something?

@radcortez
Copy link
Member

Sorry, to clarify, the measurement is with Quarkus main, and with a replace of Inet from Wildfly with this one in the PR from SR Commons. I know that Wildfly will soon delegate to SR Commons, but I believe that one is not released yet.

How are you measuring, and are you comparing against main

I didn't use any fancy measurement; a plain ps -e -o pid, RSS,args shows the difference between both versions. I believe the increased RSS came from a new Wildfly Commons version built with a more recent JDK.

For reference, this is how it looks like in main with no changes (and with WF Commons 1.7.0):

Screenshot-2024-01-12-at-12 25 25

And this is what it looks like with Quarkus 3.5.0, before the WF Common (1.5.4) update:

Screenshot-2024-01-12-at-12 25 36

And with the PR change:

Screenshot 2024-01-22 at 17 40 16

@dmlloyd
Copy link
Collaborator

dmlloyd commented Jan 22, 2024

This however all has no bearing on this PR. WF Common is clearly the main cost here.

To maximize the measurement of this change, I'd recommend building wildfly-common from main (this will be 2.0.0.Final-SNAPSHOT), and then retest the quickstart without and with this PR to see if it makes an impact. The reason for that is since wildfly/wildfly-common#100, WildFly Common will redirect to SmallRye Common for this code, so all of the relevant code paths for this class will come through one route. Then it will be possible to test just this PR in isolation to see if there is indeed a relevant cost savings.

@franz1981
Copy link
Contributor Author

franz1981 commented Jan 22, 2024

@dmlloyd

Where do you get ~217KB from?

It was from the allocation flamegraphs which @radcortez shared in #276 (comment)

@franz1981
Copy link
Contributor Author

franz1981 commented Jan 22, 2024

@dmlloyd let me apologize, my brain was still stuck at #275 which was already merged and for some reason I was thinking this PR to be that one instead 😂

I have probably misread my own comment at #276 (comment) in which I was providing the full analysis of this and the other PR, altogether, compared to without.

This one is just saving to allocate few bytes (and Strings), because we already know the addresses, in string literal form; said that I have no problem to close it, in case it looks weird or useless...

dmlloyd added a commit to dmlloyd/smallrye-common that referenced this pull request Jan 22, 2024
Also, use the new API to construct the global IP address constants, to avoid unnecessary `String` construction.

Supercedes smallrye#276.
@dmlloyd
Copy link
Collaborator

dmlloyd commented Jan 22, 2024

In that case WDYT of #279 as an API-friendly alternative?

@radcortez
Copy link
Member

Here are some additional flamegraphs:

With Wildfly Commons 2.0.0-SNAPSHOT, build with J17:

Screenshot 2024-01-23 at 15 04 06

With SR Commons 2.1.2, and using SR Commons Inet directly in InetSocketAddressConverter:

Screenshot 2024-01-23 at 15 04 34

@franz1981
Copy link
Contributor Author

Yeah @radcortez in both I see that the strong concat indy is very relevant

@dmlloyd
Copy link
Collaborator

dmlloyd commented Jan 23, 2024

Yeah we all agree about the string concat issue. Now we just need to all agree that string concat is not the subject of this PR nor the subject of this discussion. 😉

@franz1981
Copy link
Contributor Author

franz1981 commented Jan 23, 2024

I see @dmlloyd that @radcortez has been stucked with me on the previous pr ihih

dmlloyd added a commit to dmlloyd/smallrye-common that referenced this pull request Jan 23, 2024
Also, use the new API to construct the global IP address constants, to avoid unnecessary `String` construction.

Supercedes smallrye#276.
@dmlloyd
Copy link
Collaborator

dmlloyd commented Apr 3, 2024

@franz1981 do we still want to pursue this further?

@franz1981
Copy link
Contributor Author

We can ask @radcortez if dmlloyd@2274755 (which I like) superseded and fixed this same issue; if it does it, we can safely close it!

@radcortez
Copy link
Member

radcortez commented Apr 3, 2024

Correct.

@radcortez radcortez closed this Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants