UCT/IB/MLX5/DC: Access dcis by index, ASAN: relocating dcis buffer in poll_tx #10270

roiedanino · 2024-10-31T17:29:44Z

What

Access DCIs by index instead of by pointer
in poll_tx: relocate dcis array so asan would catch if memory violation is done
Set default dcis array capacity to 1.

Why ?

Suggesting a future proof fix for the issues caused by dynamic resizing of dcis array (use after free caused by using an old pointer after resize).

src/ucs/debug/debug.c

src/uct/ib/mlx5/dc/dc_mlx5.c

tvegas1 · 2024-11-01T08:36:26Z

src/uct/ib/rc/base/rc_ep.h

@@ -441,31 +470,6 @@ uct_rc_txqp_completion_op(uct_rc_iface_send_op_t *op, const void *resp)
    op->handler(op, resp);
 }

-static UCS_F_ALWAYS_INLINE void


is there a particular reason to change them to macro? in general seems more readable/safer to use functions.

Yes because we want to access the txqp by dci index inside the loop to avoid use-after-free in case the dcis array resize during one of the completions

but we could do that while still keeping them as inline functions?

Not without code duplication

ok but I still do not see why we need macros. They only use their input parameters, I did not see ## constructs, and did not find different usage with different types of arguments.

An inline function won't be able to do actual code generation that accesses the DCI array by index at every iteration of the "for each outstanding" loop, the root cause is that the loop is calling for completions callbacks which can trigger an array resize in the middle of the loop making the iterator pointing at the old buffer, using a macro we can enforce access through the ucs_array API every single iteration so even if the array was relocated we will access the new buffer, the input parameters here are an actual expression ucs_dc_mlx5_iface_dci(iface, dci)->txwq so we are not passing a pointer to txwq but passing the entire access to the most updated buffer

thanks now I see it. ucs_queue_for_each_extract makes 3 lookups of txwq per iteration, and log could be misleading, is passing a callback / putting for each loop outside or having outstanding at constant memory position not feasible for perf?

maybe use a callback to fetch/dequeue next queue element would not have perf impact

iyastreb · 2024-11-01T10:35:24Z

src/uct/ib/mlx5/dc/dc_mlx5.c

+    ucs_array_buffer_free(old_buffer_p);
+#endif
+}
+
 static UCS_F_ALWAYS_INLINE unsigned
 uct_dc_mlx5_poll_tx(uct_dc_mlx5_iface_t *iface, int poll_flags)


Is it enough to fix only code in uct_dc_mlx5_poll_tx?
I see that uct_dc_mlx5_iface_resize_and_fill_dcis may be called directly/indirectly from:

init functions, hopefully ok: uct_dc_mlx5_ep_basic_init, uct_dc_mlx5_ep_basic_init, uct_dc_mlx5_iface_init_dcis_array

uct_dc_mlx5_iface_dci_can_alloc_or_create -> uct_dc_mlx5_iface_dci_get, which is called from many places:

UCT_DC_MLX5_CHECK_DCI_RES
uct_dc_mlx5_ep_fc_pure_grant_send
uct_dc_mlx5_ep_fc_hard_req_send
UCT_DC_MLX5_CHECK_RES - MANY references!

UCT_DC_CHECK_RES_PTR

Basically this function may be called indirectly from dozens of different flows, and it's hard to track all of them.. And each workflow caches DCI internals (txqp, txwq) into local variables, so I'm not sure we can easily guarantee that it always works reliably. Even if we fix it now, it is very error-prone and it's quite hard to find the root cause

One option to make it 100% safe is to allocate TX structures on heap:

typedef struct uct_dc_dci { uct_rc_txqp_t *txqp; /* DCI qp */ uct_ib_mlx5_txwq_t *txwq; /* DCI mlx5 wq */

By doing this all the other problems are solved:

DCI object becomes "primitive" in the sense we don't even need advanced copy logic.

It's safe to cache those fields from any workflow, no need to access by index
I think it has almost zero overhead in terms of performance

@iyastreb it is indeed error-prone, though your suggestion would introduce a performance penalty on fast path.

I reflected a bit more on this, and I think the argument of performance penalty is questionable.

Because with heap allocation we resolve the tx object once (ok, potentially with a cache miss) and then use it everywhere down the flow. With index approach we resolve tx (presumably) faster, but do this resolution by index multiple times.. At the end it might be even slower, depending on how many resolution by index we have in the workflow.

Personally I believe that this penalty is negligible, and we should prefer safety and clean design over micro-optimisations

@roiedanino Any comments on that?

tvegas1 · 2024-11-12T07:27:54Z

src/uct/ib/mlx5/dc/dc_mlx5.c

+    size_t buffer_size = sizeof(uct_dc_dci_t) *
+                         ucs_array_capacity(&iface->tx.dcis);
+    size_t num_dcis    = ucs_array_length(&iface->tx.dcis);
+    void *old_buffer_p = iface->tx.dcis.buffer;


use uct_dc_dci_t * instead of void * and below

tvegas1 · 2024-11-12T07:32:58Z

src/uct/ib/mlx5/dc/dc_mlx5.c

+    size_t buffer_size = sizeof(uct_dc_dci_t) *
+                         ucs_array_capacity(&iface->tx.dcis);
+    size_t num_dcis    = ucs_array_length(&iface->tx.dcis);
+    void *old_buffer_p = iface->tx.dcis.buffer;


maybe use ucs_array_extract_buffer() or ucs_array_begin()

or could this whole malloc+copy procedure be bundled to some ucs_array_ specific macro, returning the old_buffer?

Yes we have ucs_array_grow but it checks whether we need to increase capacity and if not it won't allocate a new buffer

tvegas1 · 2024-11-12T07:35:30Z

src/uct/ib/mlx5/dc/dc_mlx5.c

    hw_ci     = ntohs(cqe->wqe_counter);

    ucs_trace_poll("dc iface %p tx_cqe: dci[%d] txqp %p hw_ci %d",
-                   iface, dci_index, txqp, hw_ci);
+                   iface, dci_index, &uct_dc_mlx5_iface_dci(iface, dci_index)->txqp , hw_ci);


remove extra space

tvegas1 · 2024-11-12T07:43:10Z

src/uct/ib/mlx5/dc/dc_mlx5.c

@@ -267,14 +270,41 @@ static void uct_dc_mlx5_iface_progress_enable(uct_iface_h tl_iface, unsigned fla
    uct_base_iface_progress_enable_cb(&iface->super.super, iface->progress, flags);
 }

+static void uct_dc_mlx5_cleanup_asan_old_dcis_buffer(uct_dc_mlx5_iface_t *iface)


maybe uct_dc_mlx5_asan_cleanup.. to have same name prefix as below?

tvegas1 · 2024-11-12T07:44:39Z

src/uct/ib/mlx5/dc/dc_mlx5.h

@@ -360,6 +367,11 @@ struct uct_dc_mlx5_iface {
    uint16_t                         flags;

    uct_ud_mlx5_iface_common_t       ud_common;
+
+#ifdef __SANITIZE_ADDRESS__
+    void *                           old_dcis_buffer;


use specific dci entry type?

tvegas1 · 2024-11-12T07:59:06Z

src/uct/ib/rc/base/rc_ep.h

@@ -441,31 +470,6 @@ uct_rc_txqp_completion_op(uct_rc_iface_send_op_t *op, const void *resp)
    op->handler(op, resp);
 }

-static UCS_F_ALWAYS_INLINE void


ok but I still do not see why we need macros. They only use their input parameters, I did not see ## constructs, and did not find different usage with different types of arguments.

src/uct/ib/mlx5/rc/rc_mlx5.h

…ixes, clang-format

tvegas1 · 2024-11-12T11:47:57Z

src/uct/ib/rc/base/rc_ep.h

+    do { \
+        uct_rc_iface_send_op_t *op; \
+        \
+        ucs_trace_poll("txqp %p complete ops up to sn %d", _txqp, _sn); \


_txqp could be misleading if pointer can change each iteration...

better than nothing? or maybe I should put it inside the loop?

yes probably use it in the loop only, or move it outside in rc and dc

tvegas1 · 2024-11-12T12:01:37Z

src/uct/ib/rc/base/rc_ep.h

@@ -441,31 +470,6 @@ uct_rc_txqp_completion_op(uct_rc_iface_send_op_t *op, const void *resp)
    op->handler(op, resp);
 }

-static UCS_F_ALWAYS_INLINE void


thanks now I see it. ucs_queue_for_each_extract makes 3 lookups of txwq per iteration, and log could be misleading, is passing a callback / putting for each loop outside or having outstanding at constant memory position not feasible for perf?

tvegas1 · 2024-11-12T12:03:19Z

src/uct/ib/rc/base/rc_ep.h

@@ -441,31 +470,6 @@ uct_rc_txqp_completion_op(uct_rc_iface_send_op_t *op, const void *resp)
    op->handler(op, resp);
 }

-static UCS_F_ALWAYS_INLINE void


maybe use a callback to fetch/dequeue next queue element would not have perf impact

gleon99 · 2024-11-18T09:30:05Z

src/uct/ib/mlx5/dc/dc_mlx5.h

@@ -61,6 +61,13 @@ struct ibv_ravh {
        (_self)->flags |= UCT_DC_MLX5_IFACE_FLAG_##_flag_name##_FULL_HANDSHAKE; \
    }

+#ifdef __SANITIZE_ADDRESS__
+#define UCT_DC_MLX5_ASAN_RELOCATE_DCIS_BUFFER(_iface) \


Minor: the macro is BUFFER, while the func is array. Is that intentional?

gleon99 · 2024-11-18T09:30:56Z

src/uct/ib/mlx5/dc/dc_mlx5.c

+    size_t buffer_size         = sizeof(uct_dc_dci_t) *
+                                 ucs_array_capacity(&iface->tx.dcis);
+    size_t num_dcis            = ucs_array_length(&iface->tx.dcis);
+    uct_dc_dci_t *old_buffer_p = ucs_array_begin(&iface->tx.dcis);


Seems both old and new are buffer, why the 1st with _p and the second w/o?

…ning issue in random policy (might be asan false positive)

… poisening tx_waitq

…o maximum capacity

…support auto

UCT/IB/MLX5/DC: ASAN - relocating dcis buffer in poll_tx

6056842

roiedanino added the Bugfix label Oct 31, 2024

roiedanino requested a review from yosefe October 31, 2024 17:29

roiedanino self-assigned this Oct 31, 2024

tvegas1 reviewed Nov 1, 2024

View reviewed changes

iyastreb reviewed Nov 1, 2024

View reviewed changes

UCT/IB/MLX5/DC: poisening old dcis buffer

99198d4

tvegas1 reviewed Nov 12, 2024

View reviewed changes

UCT/IB/MLX5/DC: added ( ) in macro, renaming functions, compilation f…

001ea47

…ixes, clang-format

tvegas1 reviewed Nov 12, 2024

View reviewed changes

UCT/IB/MLX5/DC: old_dcis_buffer: void* -> uct_dc_dci_t*

7141c12

roiedanino added the WIP-DNM Work in progress / Do not review label Nov 17, 2024

gleon99 reviewed Nov 18, 2024

View reviewed changes

roiedanino added 4 commits November 18, 2024 16:26

UCT/IB/MLX5/DC: fixed CR comments, still need to fix old_buffer poise…

cf5087d

…ning issue in random policy (might be asan false positive)

UCT/IB/MLX5/DC: can't relocate DCIs buffer in shared policies without…

4fde0e9

… poisening tx_waitq

UCT/IB/MLX5/DC: WIP - need to move arbiter heads as well

d871344

UCT/IB/MLX5/DC: at random policy don't relocate and initialize dcis t…

cb8dd1e

…o maximum capacity

roiedanino removed the WIP-DNM Work in progress / Do not review label Nov 24, 2024

UCT/IB/MLX5/DC: config dcis_init_capacity should be unsigned long to …

9fe11b6

…support auto

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UCT/IB/MLX5/DC: Access dcis by index, ASAN: relocating dcis buffer in poll_tx #10270

UCT/IB/MLX5/DC: Access dcis by index, ASAN: relocating dcis buffer in poll_tx #10270

roiedanino commented Oct 31, 2024

tvegas1 Nov 1, 2024

roiedanino Nov 1, 2024

tvegas1 Nov 1, 2024

roiedanino Nov 3, 2024

tvegas1 Nov 12, 2024

roiedanino Nov 12, 2024

tvegas1 Nov 12, 2024

tvegas1 Nov 12, 2024

iyastreb Nov 1, 2024 •

edited

Loading

gleon99 Nov 5, 2024

iyastreb Nov 15, 2024

iyastreb Nov 25, 2024

tvegas1 Nov 12, 2024

tvegas1 Nov 12, 2024

tvegas1 Nov 12, 2024

roiedanino Nov 12, 2024

tvegas1 Nov 12, 2024

tvegas1 Nov 12, 2024

tvegas1 Nov 12, 2024

tvegas1 Nov 12, 2024

tvegas1 Nov 12, 2024

roiedanino Nov 14, 2024

tvegas1 Nov 18, 2024

tvegas1 Nov 12, 2024

tvegas1 Nov 12, 2024

gleon99 Nov 18, 2024

gleon99 Nov 18, 2024

UCT/IB/MLX5/DC: Access dcis by index, ASAN: relocating dcis buffer in poll_tx #10270

Are you sure you want to change the base?

UCT/IB/MLX5/DC: Access dcis by index, ASAN: relocating dcis buffer in poll_tx #10270

Conversation

roiedanino commented Oct 31, 2024

What

Why ?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iyastreb Nov 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iyastreb Nov 1, 2024 •

edited

Loading