-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for containers running on AWS ECS #65
Conversation
judoscale/core/config.py
Outdated
@@ -71,6 +75,18 @@ def for_render(cls, env: Mapping): | |||
api_base_url = f"https://adapter.judoscale.com/api/{service_id}" | |||
return cls(runtime_container, api_base_url, env) | |||
|
|||
@classmethod | |||
def for_ecs(cls, env: Mapping): | |||
container_metadata = requests.get(env["ECS_CONTAINER_METADATA_URI_V4"]).json() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this endpoint have any authentication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find any documentation about this endpoint being auth-protected. This endpoint is a private IP supplied automatically by ECS.
There are some differences in what data is returned between running ECS-on-Fargate running ECS-on-EC2. I suspect that most folks migrating from Heroku will be on Fargate, though. But I have no data to back this up, just a gut feel.
judoscale/core/config.py
Outdated
|
||
# All ECS containers have to be manually assigned as web instances, | ||
# because ECS does not distinguish between web and worker instances. | ||
runtime_container = RuntimeContainer(service_name, instance, "web") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the only place we use this is here: https://github.com/judoscale/judoscale-python/blob/aws-ecs-support/judoscale/core/metrics_collectors.py#L33-L34
Do you agree?
And I'm really confused about its purpose there. We would only call should_collect
if we're trying to add a web metric, would would only happen on a web instance.
Does any of this ring a bell for you? I feel like I might need to dig a bit deeper on this logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure I agree. What is confusing about its purpose? The Ruby adapter seems to have an identical implementation.
For context, the actual app instance (Flask, Django, etc) is created on dyno/container types other than web
, too. E.g in the Flask RQ sample app, the app instance initialises the request middleware (judoscale.init_app(app)
) as well as the RQ extension (judoscale_rq(app.redis)
) — https://github.com/judoscale/judoscale-python/blob/aws-ecs-support/sample-apps/flask_rq_sample/app/app.py#L26-L27.
The run-worker.py
script creates an app instance in order to run the worker, and would be running both the web and queue metrics collectors.
A "web" dyno/container would report metrics for both web request latency and queue latency. But a "worker" dyno/container would only report queue latency.
But ECS as a platform does not distinguish between "web" and "worker" services — it does not assign any special names to services and tasks. I hardcoded "web" so that both request and queue latency is reported on all instances. Perhaps it'd be clearer if the last argument to the RuntimeContainer
constructor was optional and tweak .is_web_instance
to handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Ruby adapter seems to have an identical implementation.
I agree, and I'm equally confused in the Ruby adapter. I'm not questioning anything in this PR—it's more of a "why are we doing things this way in general" kind of question. And I know I'm the one who put that logic in there to begin with, so I don't necessarily expect you to have the answer.
A "web" dyno/container would report metrics for both web request latency and queue latency. But a "worker" dyno/container would only report queue latency.
This is the piece I'm confused about. We have "web" containers skip the reporting for web request latency, but if it's not a web container, there won't be any web requests to report anyway. So why have that logic at all?
In other words, would anything behave differently if we removed the is_web_instance
method (and its counterpart in the Ruby adapter)? And if we removed that, we'd have no need to differentiate "web" vs "worker" containers. We could remove the service_type
property from RuntimeContainer
and simplify a bit of our code.
That would be for a separate PR, and I apologize for the digression here. It just felt relevant since we're having to make arbitrary decisions about the service type here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have "web" containers skip the reporting for web request latency, but if it's not a web container, there won't be any web requests to report anyway. So why have that logic at all?
Did you mean "worker" instead of "web" here?
I think the logic exists to minimise sending redundant metrics and decrease unnecessary noise. This Python issue describes the logic and reasoning in more detail https://github.com/judoscale/judoscale-python-issues/issues/1.
I think we could simplify this code for sure. The "cost" is making redundant requests to the Judoscale adapter API. If that's acceptable then I think we could remove .is_web_instance
and have metrics reported on all instance types, provided that the adapter is enabled (there is an endpoint to POST metrics to).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "cost" is making redundant requests to the Judoscale adapter API.
That was the intention, but I don't think it achieves that goal in practice. A worker container won't receive web requests, so it won't have web metrics to report regardless of the is_web_instance
logic. (Sorry I've been struggling to put this into coherent words.)
I'll quit beating this horse in this particular PR and open up a new issue for it. (#66)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good to merge, but I want to have a full working end-to-end PoC before we cut a release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I dove into the Ruby adapter today, I realized that we don't need to fetch the metadata. The container ID (instance
) is part of the URL, and we don't actually care about the service_name
.
For the Ruby adapter, I actually removed the service_name
from our RuntimeContainer
class. It's now just a string wrapper with a single method for redundant instances. Note that I also removed the web/worker service_type
distinction in a previous PR.
The PR for adding ECS support to the Ruby adapter ended up being super tiny after those changes.
Does this make sense?
judoscale/core/config.py
Outdated
@@ -55,6 +57,13 @@ def for_render(cls, env: Mapping): | |||
api_base_url = f"https://adapter.judoscale.com/api/{service_id}" | |||
return cls(runtime_container, api_base_url, env) | |||
|
|||
@classmethod | |||
def for_ecs(cls, env: Mapping): | |||
instance = env["ECS_CONTAINER_METADATA_URI_V4"].split("/")[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the Ruby adapter, I used ECS_CONTAINER_METADATA_URI
instead of ECS_CONTAINER_METADATA_URI_V4
, thinking that the former would be more reliable. The docs say the latter env var is inject "beginning with Fargate platform version 1.4.0", so I'm concerned that it'll be missing for some customers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good shout! 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
No description provided.