-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCT/IB/MLX5/DC: Access dcis by index, ASAN: relocating dcis buffer in poll_tx #10270
base: master
Are you sure you want to change the base?
Conversation
@@ -441,31 +470,6 @@ uct_rc_txqp_completion_op(uct_rc_iface_send_op_t *op, const void *resp) | |||
op->handler(op, resp); | |||
} | |||
|
|||
static UCS_F_ALWAYS_INLINE void |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a particular reason to change them to macro? in general seems more readable/safer to use functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes because we want to access the txqp by dci index inside the loop to avoid use-after-free in case the dcis array resize during one of the completions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but we could do that while still keeping them as inline functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not without code duplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok but I still do not see why we need macros. They only use their input parameters, I did not see ## constructs, and did not find different usage with different types of arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An inline function won't be able to do actual code generation that accesses the DCI array by index at every iteration of the "for each outstanding" loop, the root cause is that the loop is calling for completions callbacks which can trigger an array resize in the middle of the loop making the iterator pointing at the old buffer, using a macro we can enforce access through the ucs_array API every single iteration so even if the array was relocated we will access the new buffer, the input parameters here are an actual expression ucs_dc_mlx5_iface_dci(iface, dci)->txwq
so we are not passing a pointer to txwq but passing the entire access to the most updated buffer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks now I see it. ucs_queue_for_each_extract
makes 3 lookups of txwq per iteration, and log could be misleading, is passing a callback / putting for each
loop outside or having outstanding
at constant memory position not feasible for perf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use a callback to fetch/dequeue next queue element would not have perf impact
ucs_array_buffer_free(old_buffer_p); | ||
#endif | ||
} | ||
|
||
static UCS_F_ALWAYS_INLINE unsigned | ||
uct_dc_mlx5_poll_tx(uct_dc_mlx5_iface_t *iface, int poll_flags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it enough to fix only code in uct_dc_mlx5_poll_tx
?
I see that uct_dc_mlx5_iface_resize_and_fill_dcis
may be called directly/indirectly from:
- init functions, hopefully ok:
uct_dc_mlx5_ep_basic_init
,uct_dc_mlx5_ep_basic_init
,uct_dc_mlx5_iface_init_dcis_array
uct_dc_mlx5_iface_dci_can_alloc_or_create
->uct_dc_mlx5_iface_dci_get
, which is called from many places:
UCT_DC_MLX5_CHECK_DCI_RES
uct_dc_mlx5_ep_fc_pure_grant_send
uct_dc_mlx5_ep_fc_hard_req_send
UCT_DC_MLX5_CHECK_RES
- MANY references!UCT_DC_CHECK_RES_PTR
Basically this function may be called indirectly from dozens of different flows, and it's hard to track all of them.. And each workflow caches DCI internals (txqp, txwq) into local variables, so I'm not sure we can easily guarantee that it always works reliably. Even if we fix it now, it is very error-prone and it's quite hard to find the root cause
One option to make it 100% safe is to allocate TX structures on heap:
typedef struct uct_dc_dci {
uct_rc_txqp_t *txqp; /* DCI qp */
uct_ib_mlx5_txwq_t *txwq; /* DCI mlx5 wq */
By doing this all the other problems are solved:
- DCI object becomes "primitive" in the sense we don't even need advanced copy logic.
- It's safe to cache those fields from any workflow, no need to access by index
I think it has almost zero overhead in terms of performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iyastreb it is indeed error-prone, though your suggestion would introduce a performance penalty on fast path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reflected a bit more on this, and I think the argument of performance penalty is questionable.
Because with heap allocation we resolve the tx object once (ok, potentially with a cache miss) and then use it everywhere down the flow. With index approach we resolve tx (presumably) faster, but do this resolution by index multiple times.. At the end it might be even slower, depending on how many resolution by index we have in the workflow.
Personally I believe that this penalty is negligible, and we should prefer safety and clean design over micro-optimisations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@roiedanino Any comments on that?
src/uct/ib/mlx5/dc/dc_mlx5.c
Outdated
size_t buffer_size = sizeof(uct_dc_dci_t) * | ||
ucs_array_capacity(&iface->tx.dcis); | ||
size_t num_dcis = ucs_array_length(&iface->tx.dcis); | ||
void *old_buffer_p = iface->tx.dcis.buffer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use uct_dc_dci_t *
instead of void *
and below
src/uct/ib/mlx5/dc/dc_mlx5.c
Outdated
size_t buffer_size = sizeof(uct_dc_dci_t) * | ||
ucs_array_capacity(&iface->tx.dcis); | ||
size_t num_dcis = ucs_array_length(&iface->tx.dcis); | ||
void *old_buffer_p = iface->tx.dcis.buffer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use ucs_array_extract_buffer()
or ucs_array_begin()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or could this whole malloc+copy procedure be bundled to some ucs_array_ specific macro, returning the old_buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we have ucs_array_grow
but it checks whether we need to increase capacity and if not it won't allocate a new buffer
src/uct/ib/mlx5/dc/dc_mlx5.c
Outdated
hw_ci = ntohs(cqe->wqe_counter); | ||
|
||
ucs_trace_poll("dc iface %p tx_cqe: dci[%d] txqp %p hw_ci %d", | ||
iface, dci_index, txqp, hw_ci); | ||
iface, dci_index, &uct_dc_mlx5_iface_dci(iface, dci_index)->txqp , hw_ci); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove extra space
src/uct/ib/mlx5/dc/dc_mlx5.c
Outdated
@@ -267,14 +270,41 @@ static void uct_dc_mlx5_iface_progress_enable(uct_iface_h tl_iface, unsigned fla | |||
uct_base_iface_progress_enable_cb(&iface->super.super, iface->progress, flags); | |||
} | |||
|
|||
static void uct_dc_mlx5_cleanup_asan_old_dcis_buffer(uct_dc_mlx5_iface_t *iface) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe uct_dc_mlx5_asan_cleanup..
to have same name prefix as below?
src/uct/ib/mlx5/dc/dc_mlx5.h
Outdated
@@ -360,6 +367,11 @@ struct uct_dc_mlx5_iface { | |||
uint16_t flags; | |||
|
|||
uct_ud_mlx5_iface_common_t ud_common; | |||
|
|||
#ifdef __SANITIZE_ADDRESS__ | |||
void * old_dcis_buffer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use specific dci entry type?
@@ -441,31 +470,6 @@ uct_rc_txqp_completion_op(uct_rc_iface_send_op_t *op, const void *resp) | |||
op->handler(op, resp); | |||
} | |||
|
|||
static UCS_F_ALWAYS_INLINE void |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok but I still do not see why we need macros. They only use their input parameters, I did not see ## constructs, and did not find different usage with different types of arguments.
…ixes, clang-format
src/uct/ib/rc/base/rc_ep.h
Outdated
do { \ | ||
uct_rc_iface_send_op_t *op; \ | ||
\ | ||
ucs_trace_poll("txqp %p complete ops up to sn %d", _txqp, _sn); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_txqp
could be misleading if pointer can change each iteration...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better than nothing? or maybe I should put it inside the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes probably use it in the loop only, or move it outside in rc and dc
@@ -441,31 +470,6 @@ uct_rc_txqp_completion_op(uct_rc_iface_send_op_t *op, const void *resp) | |||
op->handler(op, resp); | |||
} | |||
|
|||
static UCS_F_ALWAYS_INLINE void |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks now I see it. ucs_queue_for_each_extract
makes 3 lookups of txwq per iteration, and log could be misleading, is passing a callback / putting for each
loop outside or having outstanding
at constant memory position not feasible for perf?
@@ -441,31 +470,6 @@ uct_rc_txqp_completion_op(uct_rc_iface_send_op_t *op, const void *resp) | |||
op->handler(op, resp); | |||
} | |||
|
|||
static UCS_F_ALWAYS_INLINE void |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use a callback to fetch/dequeue next queue element would not have perf impact
src/uct/ib/mlx5/dc/dc_mlx5.h
Outdated
@@ -61,6 +61,13 @@ struct ibv_ravh { | |||
(_self)->flags |= UCT_DC_MLX5_IFACE_FLAG_##_flag_name##_FULL_HANDSHAKE; \ | |||
} | |||
|
|||
#ifdef __SANITIZE_ADDRESS__ | |||
#define UCT_DC_MLX5_ASAN_RELOCATE_DCIS_BUFFER(_iface) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: the macro is BUFFER
, while the func is array
. Is that intentional?
src/uct/ib/mlx5/dc/dc_mlx5.c
Outdated
size_t buffer_size = sizeof(uct_dc_dci_t) * | ||
ucs_array_capacity(&iface->tx.dcis); | ||
size_t num_dcis = ucs_array_length(&iface->tx.dcis); | ||
uct_dc_dci_t *old_buffer_p = ucs_array_begin(&iface->tx.dcis); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems both old
and new
are buffer
, why the 1st with _p
and the second w/o?
…ning issue in random policy (might be asan false positive)
… poisening tx_waitq
…o maximum capacity
What
Why ?
Suggesting a future proof fix for the issues caused by dynamic resizing of dcis array (use after free caused by using an old pointer after resize).