-
Notifications
You must be signed in to change notification settings - Fork 963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How does Karpenter deal with the new m7i-flex
instance type?
#4367
Comments
Can I join in and also ask when is the "regular" m7i instances will be supported by Karpenter |
@yuval-almog I'd assume that they'd be supported as soon as they were supported in EC2 Fleet and I'd expect that to be immediately where constraints are unbounded or general to instance family/generation. |
I'm also not sure exactly what Karpenter understands about PVs and the EBS CSI Driver but if that is part of the scheduling logic then there would need to be some code changes to support the new improved attachment limits for |
Discussed with the team. We think we need to make two changes.
|
After chatting with the EC2 team working on m7i-flex, we've decided to not treat this instance type as "burstable". We think this instance type will fit workloads for the majority of Karpenter users. There may be some edge cases where it's not the right fit, if you are running close to 100% node CPU utilization at all times, but I'd expect this to be a small number, so we'd like to generally treat this just like any other
Here's from the m7i-flex docs page here:
|
@bwagner5 I think you're right about My suggestion would be to add a label to differentiate between standard, flex ( |
I wonder if t, flex, and standard could be considered as QOS classes that share the same label. @jonathan-innis , we'd need to think quickly on this if we wanted to change anything for beta, since burstable is a Boolean and would need to evolve to an enum. Maybe best to do it additively and leave the old label in. |
@ellistarn can/does Karpenter provision |
By default t3 are unlimited and t2 are not. |
@ellistarn did this get discussed before the beta changes? I think the principal of a QoS class for burstable instances makes sense and I think all |
I discovered that m7i-flex is running on EKS without specifying the t type in nodepool. It could have been a major issue if it occurred in production. Is there any update? |
First, M7i-flex instances are not T instances. Comparison vs. M7i: M7i instances are a great choice for all general-purpose workloads, especially for workloads that need the largest instance sizes or continuous high CPU usage, such as large application servers and databases, gaming servers, CPU-based machine learning (ML), and video streaming. https://aws.amazon.com/ec2/instance-types/m7i/ Curious to know if you saw any issue or if you are speculating. |
@saurabhmodh4 |
I agree with what Steve Hipwell said. Why is it cheaper than m7i? It's because there's trade-offs. The basic performance is provided at 40%, but it can burst up to 95% using credits. You mentioned that it is not a T type, but to me, it seems the concept is completely identical to T type, only the name has changed. Am I right? If there is something I am misunderstanding due to lack of information, please explain. If I am correct, isn't it natural that problems could arise when m7i and m7i-flex are mixed? Imagine an EC2 provisioned in an environment operating on the assumption of 100% usage, facing performance limitations when there are no credits. Pods running on m7i-flex could experience issues such as increased latency due to not being able to use the CPU as needed, while the overall CPU usage remains low. This means that services with pod autoscaling based on CPU usage would not be able to expect autoscaling. This could cause disruptions in users' environments. Although the name change without maintaining backward compatibility is a bit surprising, it can be understood if it's for the sake of consistency with the same generation/architecture of EC2. However, making it indistinguishable from other m-type instances could be a serious problem. Also, just because it is useful for specific workloads, it does not provide a reasonable basis to limit the karpenter development team from distinguishing it, especially when there is a clear difference. |
Sorry, that is wrong. There are no credits in M7i-flex.
I dont think I got an explicit answer to my question from last time. |
@saurabhmodh4 the documentation specifically states that the
I'm also pretty sure that what you see in general EC2 instances is significantly different to what is seen Kubernetes, where we're attempting to optimise bin packing and CPU utilisation. Flex looks to be an alternative approach to cost cutting which may work in limited Kubernetes contexts but is likely to reduce performance and introduce not only additional latency but also lack of repeatability. Can I also point out that both the network and EBS bandwidths are significantly reduced for flex instances. Both of these are significant to the operation of most non-trivial Kubernetes clusters. The more I think about this the more I think that flex instances shouldn't be treated as general purpose instances in the Kubernetes context. A sub 20% price performance benefit isn't going to make up for the engineering cost of guaranteeing unexpected behaviour isn't caused by the compute platform. It's definitely not going to be worth it where performant networking or storage is involved. So ideally Karpenter makes flex instances explicit opt-in. Again I'll point out that at the bare minimum there needs to be a documented pattern for blocking current and future flex instances. |
Agree on the network and EBS baseline bandwidths and the association to non-trivial clusters that need high baseline network and EBS specs. But I bet you those are not the majority which is what I believe you are also implying. |
|
@jonathan-innis should we build a selector for amd/intel? I agree we should have a selector for flex. |
I'm supportive of a selector on CPU manufacturer, similar to how we have one for GPU manufacturer. Looks like that particular issue is tracked here: #3529. We'd definitely be receptive if anyone was wanting to pick that one up and implement it. |
@saurabhmodh4 is there any documentation you can point us to that shows how the new
How long is the instance able to hold the 95% figure? What factors determine whether I can scale my CPU usage on the instance, etc. For what it's worth I agree with the sentiment that this instance will probably be very useful for a large majority of users, but as a person looking to pack pods and and use spot instances I was surprised to see this instance in my fleet. |
To chime in on this: I think we're generally supportive of adding additional selectors here (both on cpu manufacturer and also on whether something is a flex instance type or not. Maybe some labels like In either case, this isn't something that the maintainers have bandwidth to pull in right now but we'd be supportive of a PR that added this feature in if one was opened. |
Hi guys, I opened a PR (#5769) that takes care of CPU manufacturer selection. I haven't seen an option to distinguish between flex and standard instances besides thier name, unlike burstable ones that EC2 API does expose. We can create a requirement ( |
@bwagner5 did we come up with a proposed label for this? |
How do we feel about |
@jonathan-innis -- re: |
@tehranian Do you have specific bandwidth requirements or you just want network optimized in general? |
@jonathan-innis I think we just want a general selector for "network-optimized": I imagine if we had a specific bandwidth requirement, one could use a I'd probably |
@tehranian Do you mind opening a separate issue requesting |
Filed #6122 for a boolean field for network-optimized instances 👍 |
Given that the new
m7i-flex
instance type is now available and breaks the current pattern of burstable instances being in a nonm
family; how does Karpenter currently treat these instances and allow them to be filtered out? If this isn't a default pattern what plans are there to improve this?My "two cents" is that the concept of burstable instances is very similar to the spot instance concept and should probably align with that.
The text was updated successfully, but these errors were encountered: