Shibuya distributed mode #19

iandyh · 2020-12-04T01:07:25Z

A big one. Let me put them into smaller tasks:

Local development environments. p0
Controller should have a leader as only one process should do the GC/progress check related work. (Maybe we can move the logic into worker as well)? p1
Allow Shibuya to be deployed as central or distributed mode. optional
Current engine metric reading related logic should be extracted and built as a standalone container. This is essentially the worker. p0
Communication between the controller and the worker. p0
Collect the metrics read by worker. p0
Worker release steps

iandyh · 2022-01-27T19:59:05Z

c.resumeRunningPlans()
go c.streamToApi()
go c.readConnectedEngines()
go c.checkRunningThenTerminate()
go c.fetchEngineMetrics()
go c.cleanLocalStore()
go c.autoPurgeDeployments()

Because of these goroutines, currently controller is not stateless. This is the path we can follow:

[1]. If all of the stateful logic could be moved to workers.
[2]. If not, then some kind of leader election might be required. So only the leader will do the stateful work and others will just handle API requests.

If we need 2, then we need to consider what happens during leader election, for example, during release or the leader goes down.

Another challenge is that once we have replicas for controller, Prom could not get the metrics.

resumeRunningPlans

This is required for continue reading the metrics when the controller process gets restarted. [1]

streamToApi

This is for raw metrics streaming. Currently the metrics are collected in heap memory. We need the workers to report the metrics to a broker(Redis is a good candidate) and let the controller be the consumer. [1]

checkRunningThenTerminate

We track the progress of running plan and stop(gc) everything when the duration is reached. Currently we fetch all the running plans. Seems pretty difficult to move such logic into worker. [2]

fetchEngineMetrics

This is for showing the engine metric usage in the executors side. Currently we fetch all the engines by GetDeployedCollection method. Also difficult to move to workers. [2]
We actually could not keep this method in the controller because Prom could not fetch the metrics once we scale up the controller.

cleanLocalStore

This is to clean Prom data. Easy to move. [1]

autoPurgeDeployments

This is the GC process to clean idle engines. We use GetDeployedCollection to all the engines and then filter. [2]

iandyh · 2023-07-24T13:11:38Z

Before going into details, there are also some items needs to done.

Move some controller logic into a separate process. The process also requires its runtime(Dockerfile). We want to unify the runtime with other components.
Use helm to deploy the whole package(controller, api, etc) to the cluster

iandyh self-assigned this Dec 22, 2020

iandyh added the enhancement New feature or request label Dec 22, 2020

iandyh mentioned this issue Dec 8, 2023

Distributed mode phase 1 #98

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shibuya distributed mode #19

Shibuya distributed mode #19

iandyh commented Dec 4, 2020 •

edited

Loading

iandyh commented Jan 27, 2022 •

edited

Loading

iandyh commented Jul 24, 2023 •

edited

Loading

Shibuya distributed mode #19

Shibuya distributed mode #19

Comments

iandyh commented Dec 4, 2020 • edited Loading

iandyh commented Jan 27, 2022 • edited Loading

iandyh commented Jul 24, 2023 • edited Loading

iandyh commented Dec 4, 2020 •

edited

Loading

iandyh commented Jan 27, 2022 •

edited

Loading

iandyh commented Jul 24, 2023 •

edited

Loading