Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/IB/MLX5/DC: Access dcis by index, ASAN: relocating dcis buffer in poll_tx #10270

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

roiedanino
Copy link
Contributor

What

  • Access DCIs by index instead of by pointer
  • in poll_tx: relocate dcis array so asan would catch if memory violation is done
  • Set default dcis array capacity to 1.

Why ?

Suggesting a future proof fix for the issues caused by dynamic resizing of dcis array (use after free caused by using an old pointer after resize).

@roiedanino roiedanino self-assigned this Oct 31, 2024
src/ucs/debug/debug.c Outdated Show resolved Hide resolved
src/ucs/debug/debug.c Outdated Show resolved Hide resolved
src/uct/ib/mlx5/dc/dc_mlx5.c Outdated Show resolved Hide resolved
@@ -441,31 +470,6 @@ uct_rc_txqp_completion_op(uct_rc_iface_send_op_t *op, const void *resp)
op->handler(op, resp);
}

static UCS_F_ALWAYS_INLINE void
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a particular reason to change them to macro? in general seems more readable/safer to use functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes because we want to access the txqp by dci index inside the loop to avoid use-after-free in case the dcis array resize during one of the completions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we could do that while still keeping them as inline functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not without code duplication

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok but I still do not see why we need macros. They only use their input parameters, I did not see ## constructs, and did not find different usage with different types of arguments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An inline function won't be able to do actual code generation that accesses the DCI array by index at every iteration of the "for each outstanding" loop, the root cause is that the loop is calling for completions callbacks which can trigger an array resize in the middle of the loop making the iterator pointing at the old buffer, using a macro we can enforce access through the ucs_array API every single iteration so even if the array was relocated we will access the new buffer, the input parameters here are an actual expression ucs_dc_mlx5_iface_dci(iface, dci)->txwq so we are not passing a pointer to txwq but passing the entire access to the most updated buffer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks now I see it. ucs_queue_for_each_extract makes 3 lookups of txwq per iteration, and log could be misleading, is passing a callback / putting for each loop outside or having outstanding at constant memory position not feasible for perf?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use a callback to fetch/dequeue next queue element would not have perf impact

ucs_array_buffer_free(old_buffer_p);
#endif
}

static UCS_F_ALWAYS_INLINE unsigned
uct_dc_mlx5_poll_tx(uct_dc_mlx5_iface_t *iface, int poll_flags)
Copy link
Contributor

@iyastreb iyastreb Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it enough to fix only code in uct_dc_mlx5_poll_tx?
I see that uct_dc_mlx5_iface_resize_and_fill_dcis may be called directly/indirectly from:

  • init functions, hopefully ok: uct_dc_mlx5_ep_basic_init, uct_dc_mlx5_ep_basic_init, uct_dc_mlx5_iface_init_dcis_array
  • uct_dc_mlx5_iface_dci_can_alloc_or_create -> uct_dc_mlx5_iface_dci_get, which is called from many places:
  1. UCT_DC_MLX5_CHECK_DCI_RES
    uct_dc_mlx5_ep_fc_pure_grant_send
    uct_dc_mlx5_ep_fc_hard_req_send
    UCT_DC_MLX5_CHECK_RES - MANY references!
  2. UCT_DC_CHECK_RES_PTR

Basically this function may be called indirectly from dozens of different flows, and it's hard to track all of them.. And each workflow caches DCI internals (txqp, txwq) into local variables, so I'm not sure we can easily guarantee that it always works reliably. Even if we fix it now, it is very error-prone and it's quite hard to find the root cause

One option to make it 100% safe is to allocate TX structures on heap:

typedef struct uct_dc_dci {
    uct_rc_txqp_t                *txqp; /* DCI qp */
    uct_ib_mlx5_txwq_t           *txwq; /* DCI mlx5 wq */

By doing this all the other problems are solved:

  1. DCI object becomes "primitive" in the sense we don't even need advanced copy logic.
  2. It's safe to cache those fields from any workflow, no need to access by index
    I think it has almost zero overhead in terms of performance

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iyastreb it is indeed error-prone, though your suggestion would introduce a performance penalty on fast path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reflected a bit more on this, and I think the argument of performance penalty is questionable.

Because with heap allocation we resolve the tx object once (ok, potentially with a cache miss) and then use it everywhere down the flow. With index approach we resolve tx (presumably) faster, but do this resolution by index multiple times.. At the end it might be even slower, depending on how many resolution by index we have in the workflow.

Personally I believe that this penalty is negligible, and we should prefer safety and clean design over micro-optimisations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@roiedanino Any comments on that?

size_t buffer_size = sizeof(uct_dc_dci_t) *
ucs_array_capacity(&iface->tx.dcis);
size_t num_dcis = ucs_array_length(&iface->tx.dcis);
void *old_buffer_p = iface->tx.dcis.buffer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use uct_dc_dci_t * instead of void * and below

size_t buffer_size = sizeof(uct_dc_dci_t) *
ucs_array_capacity(&iface->tx.dcis);
size_t num_dcis = ucs_array_length(&iface->tx.dcis);
void *old_buffer_p = iface->tx.dcis.buffer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use ucs_array_extract_buffer() or ucs_array_begin()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or could this whole malloc+copy procedure be bundled to some ucs_array_ specific macro, returning the old_buffer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we have ucs_array_grow but it checks whether we need to increase capacity and if not it won't allocate a new buffer

hw_ci = ntohs(cqe->wqe_counter);

ucs_trace_poll("dc iface %p tx_cqe: dci[%d] txqp %p hw_ci %d",
iface, dci_index, txqp, hw_ci);
iface, dci_index, &uct_dc_mlx5_iface_dci(iface, dci_index)->txqp , hw_ci);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra space

@@ -267,14 +270,41 @@ static void uct_dc_mlx5_iface_progress_enable(uct_iface_h tl_iface, unsigned fla
uct_base_iface_progress_enable_cb(&iface->super.super, iface->progress, flags);
}

static void uct_dc_mlx5_cleanup_asan_old_dcis_buffer(uct_dc_mlx5_iface_t *iface)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe uct_dc_mlx5_asan_cleanup.. to have same name prefix as below?

@@ -360,6 +367,11 @@ struct uct_dc_mlx5_iface {
uint16_t flags;

uct_ud_mlx5_iface_common_t ud_common;

#ifdef __SANITIZE_ADDRESS__
void * old_dcis_buffer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use specific dci entry type?

@@ -441,31 +470,6 @@ uct_rc_txqp_completion_op(uct_rc_iface_send_op_t *op, const void *resp)
op->handler(op, resp);
}

static UCS_F_ALWAYS_INLINE void
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok but I still do not see why we need macros. They only use their input parameters, I did not see ## constructs, and did not find different usage with different types of arguments.

src/uct/ib/mlx5/rc/rc_mlx5.h Outdated Show resolved Hide resolved
do { \
uct_rc_iface_send_op_t *op; \
\
ucs_trace_poll("txqp %p complete ops up to sn %d", _txqp, _sn); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_txqp could be misleading if pointer can change each iteration...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better than nothing? or maybe I should put it inside the loop?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes probably use it in the loop only, or move it outside in rc and dc

@@ -441,31 +470,6 @@ uct_rc_txqp_completion_op(uct_rc_iface_send_op_t *op, const void *resp)
op->handler(op, resp);
}

static UCS_F_ALWAYS_INLINE void
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks now I see it. ucs_queue_for_each_extract makes 3 lookups of txwq per iteration, and log could be misleading, is passing a callback / putting for each loop outside or having outstanding at constant memory position not feasible for perf?

@@ -441,31 +470,6 @@ uct_rc_txqp_completion_op(uct_rc_iface_send_op_t *op, const void *resp)
op->handler(op, resp);
}

static UCS_F_ALWAYS_INLINE void
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use a callback to fetch/dequeue next queue element would not have perf impact

@roiedanino roiedanino added the WIP-DNM Work in progress / Do not review label Nov 17, 2024
@@ -61,6 +61,13 @@ struct ibv_ravh {
(_self)->flags |= UCT_DC_MLX5_IFACE_FLAG_##_flag_name##_FULL_HANDSHAKE; \
}

#ifdef __SANITIZE_ADDRESS__
#define UCT_DC_MLX5_ASAN_RELOCATE_DCIS_BUFFER(_iface) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: the macro is BUFFER, while the func is array. Is that intentional?

size_t buffer_size = sizeof(uct_dc_dci_t) *
ucs_array_capacity(&iface->tx.dcis);
size_t num_dcis = ucs_array_length(&iface->tx.dcis);
uct_dc_dci_t *old_buffer_p = ucs_array_begin(&iface->tx.dcis);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems both old and new are buffer, why the 1st with _p and the second w/o?

@roiedanino roiedanino removed the WIP-DNM Work in progress / Do not review label Nov 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants