Add support for containers running on AWS ECS #65

karls · 2023-07-12T11:37:12Z

No description provided.

adamlogic · 2023-07-12T15:57:11Z

judoscale/core/config.py

@@ -71,6 +75,18 @@ def for_render(cls, env: Mapping):
        api_base_url = f"https://adapter.judoscale.com/api/{service_id}"
        return cls(runtime_container, api_base_url, env)

+    @classmethod
+    def for_ecs(cls, env: Mapping):
+        container_metadata = requests.get(env["ECS_CONTAINER_METADATA_URI_V4"]).json()


Does this endpoint have any authentication?

I can't find any documentation about this endpoint being auth-protected. This endpoint is a private IP supplied automatically by ECS.

There are some differences in what data is returned between running ECS-on-Fargate running ECS-on-EC2. I suspect that most folks migrating from Heroku will be on Fargate, though. But I have no data to back this up, just a gut feel.

adamlogic · 2023-07-12T16:00:51Z

judoscale/core/config.py

+
+        # All ECS containers have to be manually assigned as web instances,
+        # because ECS does not distinguish between web and worker instances.
+        runtime_container = RuntimeContainer(service_name, instance, "web")


Seems like the only place we use this is here: https://github.com/judoscale/judoscale-python/blob/aws-ecs-support/judoscale/core/metrics_collectors.py#L33-L34

Do you agree?

And I'm really confused about its purpose there. We would only call should_collect if we're trying to add a web metric, would would only happen on a web instance.

Does any of this ring a bell for you? I feel like I might need to dig a bit deeper on this logic.

I'm not 100% sure I agree. What is confusing about its purpose? The Ruby adapter seems to have an identical implementation.

For context, the actual app instance (Flask, Django, etc) is created on dyno/container types other than web, too. E.g in the Flask RQ sample app, the app instance initialises the request middleware (judoscale.init_app(app)) as well as the RQ extension (judoscale_rq(app.redis)) — https://github.com/judoscale/judoscale-python/blob/aws-ecs-support/sample-apps/flask_rq_sample/app/app.py#L26-L27.

The run-worker.py script creates an app instance in order to run the worker, and would be running both the web and queue metrics collectors.

A "web" dyno/container would report metrics for both web request latency and queue latency. But a "worker" dyno/container would only report queue latency.

But ECS as a platform does not distinguish between "web" and "worker" services — it does not assign any special names to services and tasks. I hardcoded "web" so that both request and queue latency is reported on all instances. Perhaps it'd be clearer if the last argument to the RuntimeContainer constructor was optional and tweak .is_web_instance to handle this?

The Ruby adapter seems to have an identical implementation.

I agree, and I'm equally confused in the Ruby adapter. I'm not questioning anything in this PR—it's more of a "why are we doing things this way in general" kind of question. And I know I'm the one who put that logic in there to begin with, so I don't necessarily expect you to have the answer.

A "web" dyno/container would report metrics for both web request latency and queue latency. But a "worker" dyno/container would only report queue latency.

This is the piece I'm confused about. We have "web" containers skip the reporting for web request latency, but if it's not a web container, there won't be any web requests to report anyway. So why have that logic at all?

In other words, would anything behave differently if we removed the is_web_instance method (and its counterpart in the Ruby adapter)? And if we removed that, we'd have no need to differentiate "web" vs "worker" containers. We could remove the service_type property from RuntimeContainer and simplify a bit of our code.

That would be for a separate PR, and I apologize for the digression here. It just felt relevant since we're having to make arbitrary decisions about the service type here.

We have "web" containers skip the reporting for web request latency, but if it's not a web container, there won't be any web requests to report anyway. So why have that logic at all?

Did you mean "worker" instead of "web" here?

I think the logic exists to minimise sending redundant metrics and decrease unnecessary noise. This Python issue describes the logic and reasoning in more detail https://github.com/judoscale/judoscale-python-issues/issues/1.

I think we could simplify this code for sure. The "cost" is making redundant requests to the Judoscale adapter API. If that's acceptable then I think we could remove .is_web_instance and have metrics reported on all instance types, provided that the adapter is enabled (there is an endpoint to POST metrics to).

The "cost" is making redundant requests to the Judoscale adapter API.

That was the intention, but I don't think it achieves that goal in practice. A worker container won't receive web requests, so it won't have web metrics to report regardless of the is_web_instance logic. (Sorry I've been struggling to put this into coherent words.)

I'll quit beating this horse in this particular PR and open up a new issue for it. (#66)

adamlogic

I think this is good to merge, but I want to have a full working end-to-end PoC before we cut a release.

adamlogic

As I dove into the Ruby adapter today, I realized that we don't need to fetch the metadata. The container ID (instance) is part of the URL, and we don't actually care about the service_name.

For the Ruby adapter, I actually removed the service_name from our RuntimeContainer class. It's now just a string wrapper with a single method for redundant instances. Note that I also removed the web/worker service_type distinction in a previous PR.

The PR for adding ECS support to the Ruby adapter ended up being super tiny after those changes.

Does this make sense?

adamlogic · 2023-07-17T12:48:55Z

judoscale/core/config.py

@@ -55,6 +57,13 @@ def for_render(cls, env: Mapping):
        api_base_url = f"https://adapter.judoscale.com/api/{service_id}"
        return cls(runtime_container, api_base_url, env)

+    @classmethod
+    def for_ecs(cls, env: Mapping):
+        instance = env["ECS_CONTAINER_METADATA_URI_V4"].split("/")[-1]


For the Ruby adapter, I used ECS_CONTAINER_METADATA_URI instead of ECS_CONTAINER_METADATA_URI_V4, thinking that the former would be more reliable. The docs say the latter env var is inject "beginning with Fargate platform version 1.4.0", so I'm concerned that it'll be missing for some customers.

Good shout! 👍🏻

adamlogic

🚀

adamlogic reviewed Jul 12, 2023

View reviewed changes

adamlogic approved these changes Jul 14, 2023

View reviewed changes

adamlogic requested changes Jul 14, 2023

View reviewed changes

karls force-pushed the aws-ecs-support branch from 7f22aa6 to cda84c2 Compare July 17, 2023 12:37

karls changed the base branch from main to simplify-runtime-container July 17, 2023 12:37

adamlogic reviewed Jul 17, 2023

View reviewed changes

adamlogic approved these changes Jul 17, 2023

View reviewed changes

Base automatically changed from simplify-runtime-container to main July 18, 2023 10:59

karls added 4 commits July 18, 2023 14:00

Add configuration entrypoint for ECS

c1a434b

Instantiate RuntimeContainer based on container metadata URI

1c15200

Remove unneeded deps

cca5591

Use metadata V3 URI

a3bf95e

karls force-pushed the aws-ecs-support branch from 1951eb4 to a3bf95e Compare July 18, 2023 11:01

karls merged commit 3a6a356 into main Jul 18, 2023
4 checks passed

karls deleted the aws-ecs-support branch July 18, 2023 11:02

github-actions bot mentioned this pull request Jul 18, 2023

chore(main): release 1.5.0 #68

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for containers running on AWS ECS #65

Add support for containers running on AWS ECS #65

karls commented Jul 12, 2023

adamlogic Jul 12, 2023

karls Jul 13, 2023

adamlogic Jul 12, 2023

karls Jul 13, 2023

adamlogic Jul 13, 2023

karls Jul 14, 2023

adamlogic Jul 14, 2023 •

edited

Loading

adamlogic left a comment

adamlogic left a comment

adamlogic Jul 17, 2023

karls Jul 17, 2023

adamlogic left a comment

Add support for containers running on AWS ECS #65

Add support for containers running on AWS ECS #65

Conversation

karls commented Jul 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamlogic Jul 14, 2023 • edited Loading

Choose a reason for hiding this comment

adamlogic left a comment

Choose a reason for hiding this comment

adamlogic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamlogic left a comment

Choose a reason for hiding this comment

adamlogic Jul 14, 2023 •

edited

Loading