Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shibuya distributed mode #19

Open
5 of 7 tasks
iandyh opened this issue Dec 4, 2020 · 2 comments
Open
5 of 7 tasks

Shibuya distributed mode #19

iandyh opened this issue Dec 4, 2020 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@iandyh
Copy link
Contributor

iandyh commented Dec 4, 2020

A big one. Let me put them into smaller tasks:

  • Local development environments. p0

  • Controller should have a leader as only one process should do the GC/progress check related work. (Maybe we can move the logic into worker as well)? p1

  • Allow Shibuya to be deployed as central or distributed mode. optional

  • Current engine metric reading related logic should be extracted and built as a standalone container. This is essentially the worker. p0

  • Communication between the controller and the worker. p0

  • Collect the metrics read by worker. p0

  • Worker release steps

@iandyh iandyh self-assigned this Dec 22, 2020
@iandyh iandyh added the enhancement New feature or request label Dec 22, 2020
@iandyh
Copy link
Contributor Author

iandyh commented Jan 27, 2022

c.resumeRunningPlans()
go c.streamToApi()
go c.readConnectedEngines()
go c.checkRunningThenTerminate()
go c.fetchEngineMetrics()
go c.cleanLocalStore()
go c.autoPurgeDeployments()

Because of these goroutines, currently controller is not stateless. This is the path we can follow:

[1]. If all of the stateful logic could be moved to workers.
[2]. If not, then some kind of leader election might be required. So only the leader will do the stateful work and others will just handle API requests.

If we need 2, then we need to consider what happens during leader election, for example, during release or the leader goes down.

Another challenge is that once we have replicas for controller, Prom could not get the metrics.

resumeRunningPlans

This is required for continue reading the metrics when the controller process gets restarted. [1]

streamToApi

This is for raw metrics streaming. Currently the metrics are collected in heap memory. We need the workers to report the metrics to a broker(Redis is a good candidate) and let the controller be the consumer. [1]

checkRunningThenTerminate

We track the progress of running plan and stop(gc) everything when the duration is reached. Currently we fetch all the running plans. Seems pretty difficult to move such logic into worker. [2]

fetchEngineMetrics

This is for showing the engine metric usage in the executors side. Currently we fetch all the engines by GetDeployedCollection method. Also difficult to move to workers. [2]
We actually could not keep this method in the controller because Prom could not fetch the metrics once we scale up the controller.

cleanLocalStore

This is to clean Prom data. Easy to move. [1]

autoPurgeDeployments

This is the GC process to clean idle engines. We use GetDeployedCollection to all the engines and then filter. [2]

@iandyh
Copy link
Contributor Author

iandyh commented Jul 24, 2023

Before going into details, there are also some items needs to done.

  • Move some controller logic into a separate process. The process also requires its runtime(Dockerfile). We want to unify the runtime with other components.
  • Use helm to deploy the whole package(controller, api, etc) to the cluster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant