Message size Best practice ( 1 send tagged + 1 get vs 2 send tagged) #6528

Blopeur · 2021-03-19T17:24:39Z

Blopeur
Mar 19, 2021

Hi , we are currently using UCX to accelerate a Big Data platform, part of the requirement is that we need to support both TCP and RDMA transport (in java).

However, we are struggling a bit with extracting the best performance regarding message size and when to select RMA vs B-Copy.

The question boils down to this, when is it preferred to do the following flow :

Server send via tagged message "buffer address" + "key" +"size"
Client issue an endpoint getnonblocking using the info from the server message

Vs :

server send two tagged messages, first contains "buffer size" + a tag , second contains the data
client read the first message and then issue a subsequent recvTaggedNonBlocking using the BuffeSize and tag

What is the data size cutoff point where it is preferred to do the first flow vs the second flow (1MB , 4MB, 16MB, 4k , 64k, 128k 512k,... ) ? Also does the transport (TCP vs IB) type affect the decision?

Note that we assume we a pre-allocated buffer cache to reuse rkey, reduce registration, etc..

Answered by petro-rudenko

Mar 19, 2021

From our experience with SparkUCX project - the best performance is when you can exchange (address + rkey) beforehand and use one-sided RDMA API at critical data path, because it doesn't require active progress on the other side. Of course, it won't work with TCP. So if you can't support initial metadata exchange - tagged or active messages API would work better, because:

The same code would work for both TCP and RDMA transport
UCX would decide internally what to do the best (for tiny messages it'll use short send, for medium bcopy and for large the same RDMA with rkey exchange by ucx).

View full answer

petro-rudenko · 2021-03-19T21:39:58Z

petro-rudenko
Mar 19, 2021
Collaborator

From our experience with SparkUCX project - the best performance is when you can exchange (address + rkey) beforehand and use one-sided RDMA API at critical data path, because it doesn't require active progress on the other side. Of course, it won't work with TCP. So if you can't support initial metadata exchange - tagged or active messages API would work better, because:

The same code would work for both TCP and RDMA transport
UCX would decide internally what to do the best (for tiny messages it'll use short send, for medium bcopy and for large the same RDMA with rkey exchange by ucx).

12 replies

shamisp Mar 22, 2021
Maintainer

The memory copy using CMA mechanism requires system call and therefore it is not very effective for medium size messages. BCOPY was largely designed to support medium size messages and therefor other shared memory mechanism are used instead of BCOPY.

Blopeur Mar 22, 2021
Author

adding UCX_ZCOPY_THRESH=0 didn't really change anything

result from the ucxinfo :

#
# UCP endpoint
#
#               peer: <no debug data>
#                 lane[0]:  2:self/memory.0 md[2]           -> md[2]/self     am am_bw#0
#                 lane[1]:  6:cma/memory.0 md[5]            -> md[5]/cma      rma_bw#0
#
#                tag_send: 0..<egr/short>..8185..<egr/bcopy>..8192..<rndv>..(inf)
#            tag_send_nbr: 0..<egr/short>..8185..<egr/bcopy>..262144..<rndv>..(inf)
#           tag_send_sync: 0..<egr/short>..8185..<egr/bcopy>..8192..<rndv>..(inf)
#
#                  rma_bw: mds rndv_rkey_size 9
#

Blopeur Mar 22, 2021
Author

oops my bad , it seems upgrading to ucx 1.10.0 seems to have fixed that ..

Note their is something strange I noticed when upgrading.
I had to change the ucx connection establishment because it was throwing worker.progress++ == 0

no more TCP ..

petro-rudenko Mar 23, 2021
Collaborator

Can you please send logs where exception occurs. Thanks

Blopeur Mar 23, 2021
Author

I can do even better, describe where and how it happen and how I fixed it. It looks like 1.9 has a potential issue that can trigger connection establishment hang

Original flow :
On client side

post send tagged message to server (setup for exchanging worker address)
post send rcv tagged from server on worker
call worker progressRequest()
In the callback , we close endpoint, create a new endpoint with the new worker address received and start a thread that will loop on "worker.progres() == 0 ; worker.waitforevent()"

in 1.9.0 , no problem
in 1.10. we crash with worker.progress++ == 0 error

Changing the flow with

post send tagged
endpoint.progressRequest(sendRequest)
post rcv tagged
worker.progressRequest(rcvRequest)
if request succeed, get the worker address from buffer and start thread with the worker.progress .

I am suspecting the second flow is better , and the worker progress was decremented before callback finish in 1.9. In 1.10 progress was decremented after the callback finish. making it safer.

It was just strange to experience it when upgrading.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Message size Best practice ( 1 send tagged + 1 get vs 2 send tagged) #6528

{{title}}

Replies: 1 comment 12 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Message size Best practice ( 1 send tagged + 1 get vs 2 send tagged) #6528

Blopeur Mar 19, 2021

Replies: 1 comment · 12 replies

petro-rudenko Mar 19, 2021 Collaborator

shamisp Mar 22, 2021 Maintainer

Blopeur Mar 22, 2021 Author

Blopeur Mar 22, 2021 Author

petro-rudenko Mar 23, 2021 Collaborator

Blopeur Mar 23, 2021 Author

Blopeur
Mar 19, 2021

Replies: 1 comment 12 replies

petro-rudenko
Mar 19, 2021
Collaborator

shamisp Mar 22, 2021
Maintainer

Blopeur Mar 22, 2021
Author

Blopeur Mar 22, 2021
Author

petro-rudenko Mar 23, 2021
Collaborator

Blopeur Mar 23, 2021
Author