Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand NUMA-aware resource allocation feature to finer scope #2007

Open
kyujin-cho opened this issue Apr 8, 2024 · 0 comments
Open

Expand NUMA-aware resource allocation feature to finer scope #2007

kyujin-cho opened this issue Apr 8, 2024 · 0 comments
Assignees
Milestone

Comments

@kyujin-cho
Copy link
Member

kyujin-cho commented Apr 8, 2024

Main idea

Follow-up of #491.
It will be great to update our NUMA aware resource allocation feature, such as taking accelerator's parent PCIe switch into account, for example.

(Agent) Expansion of AffinityMap

Currently the device distance calculation is based on the PCI-reported numa node index only.

We could extend this distance calculation logic to consider additional information from the PCIe bus address. For example, we could use the bus number to further distinguish PCIe switches.

First, we need to check if we can determine the existence and layout of PCIe switches from the bus addresses.

(Manager) Hierarchical agent selection strategy for multi-node sessions

This will be a follow-up of #1394 and #1655.

It will include additional location metadata in agent heartbeats (e.g., rack number, rack groups, etc.) and considering them when distributing the containers of a single cluster session.

@kyujin-cho kyujin-cho added the type:feature Add new features label Apr 8, 2024
@kyujin-cho kyujin-cho added this to the 24.09 milestone Apr 8, 2024
@achimnol achimnol removed the type:feature Add new features label Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants