Distributed Deployment Configuration #64

mapl · 2020-12-30T02:10:10Z

Assume you have Gatus deployed in several security subnets (or zones) to monitor individual services because one single Gatus instance is not able to reach those services (administratively prohibited due to firewall rules, etc.)

But you want one main Gatus instance which is capable of retrieving health information of services from all those other Gatus instances to display them in one unified Gatus Dashboard.

Can we start a short discussion on this?

PS. Thank you for this great project!

TwiN · 2020-12-30T02:42:18Z

Hello @mapl.

This is quite an edge case, so honestly, I'm not sure I'll be personally working on this unless more people show interest for this feature (based on the number of 👍 on the issue).

The easiest way to implement this that I can think of is by having a way to allow pushing data into Gatus, as opposed to the current behavior, which only allows retrieving data from Gatus.
By leveraging this as well as a configuration that allows specifying whether a Gatus instance is the primary instance or a secondary instance, the latter required to specify the endpoint of the primary instance, it would be possible to have a "global" dashboard and multiple Gatus instances configured independently.

Fortunately, the easiest way is also the most convenient one, because the other ones would likely involve persistence.

There's an even easier solution, but it assumes that the users accessing the dashboard has access to all "security subnets/zones", which I'm not sure is the case based on your explanation. The only work required for to send a request from the Gatus' dashboard frontend to each backends and merge the statuses.

mapl · 2021-01-23T22:01:12Z

I made a quick diagram about a simple distributed Gatus Deployment where the main Gatus Instance just pulls data from remote Gatus instances.

The Data from each remote Gatus instance is just embedded into the Main Gatus Dashboard.

sefaphlvn · 2021-01-24T09:38:42Z

I think this is a good idea. I would like it if there was

TwiN · 2021-01-24T10:15:10Z

That looks good.

@mapl What do you think would be the appropriate behavior when there are overlapping service names?

Also, how about we do the opposite: the remote Gatus instances push their data to the main Gatus instance?
I think that would allow a lot more flexibility, especially if, for instance, one of the remote Gatus instances is running in an environment completely inaccessible from the main Gatus instance (i.e. locally).

Of course, this would require a layer of security, but I built something oddly relevant to this specific use case: https://github.com/TwiN/g8

We'd also need to add something that periodically cleans up services that haven't been refreshed in a long time (i.e. in case a remote instance is taken offline, we don't want to keep the outdated service health checks on the dashboard forever).

mapl · 2021-01-24T23:22:17Z

Overlapping service names

Hmm... not sure. Is this actually supposed to be a problem? A UUID would solve the most of the issues I think.

Clean up services

A steady Health Status of the remote Gatus instance would be cool, like the last reached Status, just to know if its still alive.

There is and always been an ongoing debate about Pulling VS Pushing when it comes to monitoring.

For example, if you have 100 remote Gatus instances and each of those instances would push data to the main Gatus instance, the main instance is quickly subject to an overload of metric data.

I am wondering if it is enough to just proxy through data from the remote instances to the main instance, or is caching needed?

The goal should always be a simple as possible design.

https://dave.cheney.net/2019/07/09/clear-is-better-than-clever

Security

An API Token is commonly used to restrict permissions. What's your opinion on this?

Anyway, a very goof reference is Prometheus and why the devs decide to go with pulling rather than pushing.

However, that's not a one way road, so it depends on the scenario. For the most part Pulling is the better choice.

https://prometheus.io/blog/2016/07/23/pull-does-not-scale-or-does-it/

https://prometheus.io/docs/introduction/faq/#why-do-you-pull-rather-than-push

TwiN · 2021-01-30T05:24:41Z

Fair enough, though if we're pulling, then there's no need for a token since the endpoint needs to be public for the dashboard to be shown.

Sounds good!

mapl · 2021-02-01T21:33:48Z

If you can ensure that the main Gatus instance can reach its remote instances via http(s) to fetch the json data, it should be fine

What do you think of this config idea snippet to be deployed on the main instance? I thought an Api key could be handy but not required as you mentioned.

remote-services:
  - name: Gatus Remote Instance 1	
    url: http://10.10.10.10/api/v1/statuses
    api-key: XXXXXXXXXXXXXX
    interval: 10s
  - name: Gatus Remote Instance 2 	
    url: http://10.10.10.20/api/v1/statuses
    api-key: XXXXXXXXXXXXXX
    interval: 5s
  - name: Gatus Remote Instance 3
    url: http://10.10.10.30/api/v1/statuses
    api-key: XXXXXXXXXXXXXX
    interval: 10s

TwiN · 2021-02-02T23:20:21Z

I was thinking something along the lines of

remote:
  instances:
    - name: "gatus-internal"
      url: "http://10.10.10.10/api/v1/statuses"
      interval: 30s

because in the future, we may have to add other parameters specific to the remote configuration, such as:

strategy: merge: merge all statuses fetched from the remote instances with existing, overlapping service names, or create new statuses if they already exist
strategy: prefix-service: Prepend each retrieved service names with the configured name of the remote instance
strategy: prefix-group: Prepend each retrieved group names with the configured name of the remote instance

Anyways, this isn't for right now, but I still think it's good to make the configuration as extensible as possible to prevent future breaking changes.

mapl · 2021-02-05T23:36:33Z

Absolutely, the config should be as extensible as possible and future proof where possible. Good point.
Your concerns about possible service names clashes when the main and remote instances have the same names could be an actual issue if you put them in one flat view.
I think, if every remote instance is moved into its own, lets says "container" box, then you could clearly derive where its origin is.
So if you have duplicate names in this case, it wouldn't matter as they are displayed in their own container, in a way you added this nice grouping feature.
Additionally, you could also check the health state of the entire remote instance, about last time reached for instance, latency, etc.

TwiN · 2021-03-02T04:48:23Z

I've been thinking a bit more about the implementation for this, and one issue I can think of is how to handle alerting.

Assuming that the purpose is to monitor internal applications within a network that isn't publicly accessible by other users, but is accessible by a "remote" Gatus instance, how would alerting for a "remote" Gatus instance be handled, assuming that the alerting configuration as well as the individual alerts for each services are not exposed through Gatus' main endpoint (/api/v1/statuses)?

I think it's safe to assume that each remote instances are expected to deal with their own alerts and that the only difference between the remote instance(s) and the main instance is that the main instance's dashboard must display the statuses from the remote instances as well.

All in all, I don't think this is a problem, but felt like specifying this through a comment would be worthwhile for the sake of traceability.

martincolladodev · 2021-03-12T09:11:45Z

Maybe would be a good idea to have different systems?
One implemented with a check-in system, and another pulling data from intranet networks.

Check-in system: Could give the control to developers to implement maybe some logic into the check (for example, in our systems we use healthchecks to run certain tests and if everything goes well, we send the ok to the healthchecks api endpoint, so we know that we are not only receiving that the server responds and is alive, plus adding complementary logic to the checks. For example, knowing that in our databases we're receiving data or our auth service is working properly, for example)

Statping: https://github.com/statping/statping
Healthchecks: https://github.com/healthchecks/healthchecks

About the alerting on that mode of check-in, it is necessary to setup the expectation in time when the check-in must be in, and if is delayed, starts alerting.

TwiN · 2021-06-15T22:42:58Z

Haven't really had the time to work on this, unfortunately, but #124 may make implementing this much easier if we were to leverage a global database.

Each individual Gatus would be in charge of alerting for their respective services, but the data would be retrieved from a global database that they all share (though this could be made configurable, in that the instance could choose to retrieve only the services it's monitoring, or all services present in the database)

There's obviously a few things that would need some thinking, like how to detect when one of the Gatus' instance configuration no longer has a given service (because it was deleted) so that we can automatically delete them.

Currently, since there's only one instance, there's no problem, but in a distributed setting, that won't work without a consensus of some sort, or maybe a table with a column to differentiate each individual Gatus instance as well as all the services registered under that instance could suffice?
e.g.

gatus-1 has service-a and service-b
gatus-2 has service-c and service-d
When gatus-2's configuration is modified to remove service-d, gatus-2 would update the table to remove service-d because according to the table, gatus-2 was previously assigned to service-d but service-d is no longer in the configuration

fatihbm · 2021-07-17T17:08:14Z

There are good ideas in here. I hope this feature release as soon as possible. I'm looking forward to it 🙂

TwiN · 2021-08-11T00:26:36Z

FYI: with #136 merged, #124 is not that far off.

Note to self: Will probably need to add a parameter to control https://github.com/TwinProduction/gatus/blob/acb6757dc800b43b5a24e1fbe0ebf9f64b42df4f/storage/store/store.go#L25-L28

TwiN · 2021-09-05T01:29:33Z

Just to add on #64 (comment) and #64 (comment):

After giving it some thought, this is much easier than I initially anticipated.

The easiest, most barebone implementation I can think of is the following:

Implement Postgres as a storage solution #124 (easy because of Support SQLite as a storage medium #136)
Add config parameter distributed.enabled.
- When set to true, three things happen:
  - storage.type must be set to postgres, or the application fails to start
  - Gatus doesn't require that at least one service is configured. This is for the scenario where you have a "public" instance in the DMZ whose only role is to let users view the dashboard -- it does not monitor any services its own, only pulls them from Postgres (which SHOULD be getting populated by the "private" Gatus instances).
  - DeleteAllServiceStatusesNotInKeys is never called. If a service is removed from one of the Gatus instances, it must be deleted from the database manually (keep in mind that this is a barebone implementation/MVP. Support for automatically cleaning up could be added later)

And that's it!
Even by a conservative estimate, this is less than a week of work. I just don't know when I'll find the time to work on it.

Just thinking about the possibilities that this could bring hypes me up quite a bit.

Consider the following: You're provisioning a fleet of clusters in different Cloud providers/accounts, and each of them have their own Gatus instance -- all of which are configured to monitor their respective "private" cluster, but they're pushing the data in a global Postgres database. To wrap everything up, you have a single "global" Gatus instance which doesn't monitor anything, but it's publicly available, and it exposes the data from each individual cluster.

Anyways, I digress.

mapl · 2022-02-21T14:33:32Z

I am looking forward to the implementation of this feature

appleboy · 2022-07-28T07:31:26Z

I propose one solution: server and agent.

agent makes a new request to the server
server response what endpoint has not been handled.
agent starts the watchdog to handle the endpoint.

THIS IS AN EXPERIMENTAL FEATURE/IMPLEMENTATION, AND IT MAY BE REMOVED IN THE FUTURE. Note that for now, it will be an undocumented feature.

TwiN · 2022-07-29T00:28:33Z

So I just had a very simple idea for a temporary solution, and I just had to give it a shot (see #307).

I still think that the best approach to this problem will be by leveraging a shared database, but for now, this might do the trick for some of you.

Basically, all it does it retrieve the endpoint statuses from another remote host before returning the endpoint statuses. You may specify an endpoint prefix to prefix the name of all endpoints coming from a given remote instance with a string. It's an extremely shallow/lazy implementation, but to group endpoint statuses or bypass firewalls, this should do the trick.

remote:
  instances:
    - endpoint-prefix: "myremoteinstance-"
      url: "https://status.example.org/api/v1/endpoints/statuses"

Note that I haven't documented the feature yet, because it's experimental and it may be removed and/or updated.

Anyways, I'd love if some of you could give it a try it and let me know how it works.

THIS IS AN EXPERIMENTAL FEATURE/IMPLEMENTATION, AND IT MAY BE REMOVED IN THE FUTURE. Note that for now, it will be an undocumented feature.

TwiN · 2022-08-22T22:32:23Z

One of the issues with #307 is that clicking on the individual endpoint on the UI does not work. In other word, the page for viewing individual endpoints does not work if said endpoint comes from a remote Gatus instance.

I'm going to release this with v4.1.0, but I'm strongly considering getting rid of that implementation, unless somebody is actually using it and finds it helpful.

nzsambo · 2022-08-25T18:51:05Z

Thanks @TwiN for your efforts here, even if you're not happy with the implementation, it's a great start!

I would use this to monitor the same group of services from multiple locations, rather than just bypass security. I get that throwing an instance behind a NAT firewall is definitely great, even better is being able to see metrics from different perspectives of the Internet. My old smokeping installs highlight issues that are only evident from certain locations, perhaps caused by a particular ISP.

Monitoring the same set of targets from all offices of an organisation could also be an advantage.

The original problem this commit is trying to solve was secured subnets, and this implementation would require port forwarding or pinholes as you mentioned in a very early comment. On the whole push-pull argument my vote would definitely be push from the agents / remote clients. An api listening on 80/443 on the main instance with a simple static token I reckon. It gets around inbound firewalls.

I have been meaning to try configuring remote gatus instances to use the database on the main instance. I would carefully configure the groups and services on each node. It wouldn't be ideal but it might work for me.

Anyway, I'll try this commit and report back.

Cheers!

Sam

hypervtechnics · 2022-08-26T13:23:39Z

I also have the same or a very similar use case as @nzsambo and I also agree on this being a push mechanism.

I'd also recommend a static token per agent. The configuration should look something like this:

# Doesn't matter whether array with name of agent or dictionary with name as key ;)
agents:
  - name: internet-pov
    token: abc123
    # Would contain a list of endpoints the agent is allowed to receive the configuration for (may also defined as a file on the agent) and update the status from their point of view. If empty will allow all:
    endpoints:
      - name: some-endpoint-defined-in-endpoints
        critical: true # If the agent is unavailable or reports the status for this endpoint as down/degraded this will mark the endpoint on the host as down instead of degraded

This would also require to add a new dimension for an endpoint check result to also contain the pov (agent or host). Maybe an option can be added to define the distribution behaviour with a scope (host: only the gatus host is allowed, agent: only the gatus agents are allowed may also be only one agent depending on the multiplier, all) and multiplier/point of view mode: (single: only one instance and will default to single if scope is host, all: all instances in scope)?

But please be aware of a version/api dependency from agent to host. The agent could be the regular gatus container in an agent mode where certain features (like Web UI) are disabled. Also the metrics sent to the host can be removed from the local storage.

praveen-livspace · 2022-09-16T09:43:59Z

This is super helpful @TwiN . It helps in separation of concerns and a single pane of glass for observability. I would suggest enhancing the feature instead.

xeruf · 2022-10-24T07:35:24Z

I don't think this is an edge-case if you think bigger: I am considering a HA setup, but separate from the kubernetes cluster gatus should monitor to prevent entanglement. The proposed server-agent solutions also don't solve that, unless you can distribute the server, at which point you arrive at a design similar to kubernetes: server instance as control plane and agents as nodes.

I want an instance of gatus running inside our servers in Frankfurt, on the same machine as the services to monitor, and another one in the server on our office (similar to previous comment #64 (comment)). Both should be visible in a single dashboard, but this dashboard needs to survive failure of any of the (two) nodes!
(I can add a third one if that's needed as typically in proper HA)

With HA like this, gatus would become a true enterprise-ready tool.

hypervtechnics · 2022-10-24T08:59:02Z

I think HA and a distributed deployment should be seen as two different things. Both have their challenges. How do you imagine a HA setup to work internally? Same database in the background? Syncing between the instances?

xeruf · 2022-10-30T21:56:04Z

Now that I revisit this, yes, both these concerns can be handled separately (but then also combined), and there is already quite an elaborate issue about HA: #176

However, I think the ideal solution would combine both approaches, as you might need multiple instances active simultaneously to query all services, which each should be highly available, so both issues should be considered together.

hypervtechnics · 2022-11-13T20:36:35Z

Some orientation from another project, which also contains some of the ideas I already mentioned: https://oss.oetiker.ch/smokeping/doc/smokeping_master_slave.en.html

adamreed90 · 2024-02-27T00:44:20Z

We have multiple locations, a distrubuted testing agent is essential for us to ensure our services are available from all intranet locations and how they're performing. The inverse of most status pages (where one agent tests to various locations), we have been looking for the opposite forever! Would love to see this!

TwiN · 2024-04-28T23:55:06Z

FYI, I implemented external endpoints not long ago and it can serve as a way to bypass connectivity challenges.

Long story short, rather than Gatus making the requests, you can now make the requests from within your own environments using whatever tool/application you have internally, and then push the results to Gatus.

marrold · 2024-05-18T23:25:08Z

WARNING: This feature is a candidate for removal in future versions. Please comment on the issue above if you need this feature.

This feature seems useful to me, although the broken click on endpoint is unfortunate

joryirving · 2024-05-20T16:12:02Z

So I'm currently using external endpoints. I have one primary, externally available instance, as well as two local instances. It's polling those instances, and I'm wonder if it would be better to have those push instead.

Additionally, I wonder if pushing to the primary gatus would store the endpoints data in the DB, as primary is using psql, but the other two are storing in memory, so I lose all the data on container restart.

bradbendy · 2024-06-04T16:34:45Z

Not sure I should be posting this in here or a new PR. I have 3 remote-instances right now, everything works fine it seems. When a service on the remote-instance fails a check it reports it, all that works good.

I added a endpoint on the master node to do a HTTP check to the API and if that fails then alert me, figured it's the easiest way to see if the remote-instance is alive and well. In my testing if that endpoint fails OR ANY of the remote-instances go offline they disappear from the GUI and only the local endpoint appear. I would expect if one remote-instance is offline the others would still continue to work as nothing is dependent on that.

I looked at the disable-monitoring-lock but that does not change the behavior. The log shows "silently failed to retrieve endpoint statues" and never tries any others, just stops going to anything else in the config.

Is this the correct behavior and can't be a change or looks like a bug? I don't see how that would be useful since then you really have no idea what is done since every single endpoint disappears. Running the latest github download as of June 3rd 2024.

Thanks!

muhlba91 · 2024-07-04T15:26:41Z

i have the same use case that there's one public instance but i have two secured networks where i want to display the service status on the public instance.
a pull as per the remote configuration doesn't work because the private instances are not reachable from outside the network and neither is external-endpoints helpful here, or at least i wasn't able to figure out how to configure a private gatus instance to push results to the external one.
i agree that some kind of server-agent model would be great where the agents are just pushing their results to the main/server instance.

TheAnachronism · 2024-08-24T00:40:31Z

FYI, I implemented external endpoints not long ago and it can serve as a way to bypass connectivity challenges.

Long story short, rather than Gatus making the requests, you can now make the requests from within your own environments using whatever tool/application you have internally, and then push the results to Gatus.

Can Gatus itself act as that internal tool? Basically, just have two instances running and one pushes its results to the other.

zubieta · 2024-10-24T11:04:17Z

I was trying to just push some statuses from one gatus to another via external-enpoint and custom alerts. Unfortunately, it looks like it is not possible to trigger a request for every status check (since currently alerts only trigger on failures after some threshold). But I think it is worth checking if we could modify the alerting mechanism, since it might involve less extra implementation work.

TwiN · 2024-10-25T01:17:21Z

@zubieta that strategy never crossed my mind, but that's a very clever way to simulate remote instances without the pull mechanism

TwiN changed the title ~~RFC: Distributed Deployment Configuration~~ Distributed Deployment Configuration Jan 30, 2021

TwiN added the feature New feature or request label Jan 30, 2021

TwiN pinned this issue Feb 18, 2021

TwiN mentioned this issue Mar 12, 2021

Feature Request: client health update #95

Closed

TwiN mentioned this issue Jun 15, 2021

Scalability limits? #104

Closed

TwiN mentioned this issue Jul 12, 2021

Support SQLite as a storage medium #136

Closed

TwiN mentioned this issue Sep 17, 2021

High availability mode #176

Open

TwiN unpinned this issue Dec 12, 2021

TwiN pinned this issue Jul 28, 2022

TwiN added a commit that referenced this issue Jul 29, 2022

feat(remote): Implement lazy distributed feature (#64)

15e6107

THIS IS AN EXPERIMENTAL FEATURE/IMPLEMENTATION, AND IT MAY BE REMOVED IN THE FUTURE. Note that for now, it will be an undocumented feature.

TwiN mentioned this issue Jul 29, 2022

feat(remote): Implement lazy distributed feature #307

Merged

TwiN added a commit that referenced this issue Jul 29, 2022

feat(remote): Implement lazy distributed feature (#64)

1aa94a3

THIS IS AN EXPERIMENTAL FEATURE/IMPLEMENTATION, AND IT MAY BE REMOVED IN THE FUTURE. Note that for now, it will be an undocumented feature.

TwiN mentioned this issue Sep 13, 2022

Detail page of remote endpoints throw 404 #329

Open

Distributed Deployment Configuration #64

Distributed Deployment Configuration #64

Comments

mapl commented Dec 30, 2020

TwiN commented Dec 30, 2020

mapl commented Jan 23, 2021

sefaphlvn commented Jan 24, 2021

TwiN commented Jan 24, 2021 • edited Loading

mapl commented Jan 24, 2021

Overlapping service names

Clean up services

Security

TwiN commented Jan 30, 2021

mapl commented Feb 1, 2021

TwiN commented Feb 2, 2021

mapl commented Feb 5, 2021

TwiN commented Mar 2, 2021

martincolladodev commented Mar 12, 2021

TwiN commented Jun 15, 2021

fatihbm commented Jul 17, 2021

TwiN commented Aug 11, 2021 • edited Loading

TwiN commented Sep 5, 2021 • edited Loading

mapl commented Feb 21, 2022

appleboy commented Jul 28, 2022

TwiN commented Jul 29, 2022 • edited Loading

TwiN commented Aug 22, 2022

nzsambo commented Aug 25, 2022

hypervtechnics commented Aug 26, 2022 • edited Loading

praveen-livspace commented Sep 16, 2022

xeruf commented Oct 24, 2022 • edited Loading

hypervtechnics commented Oct 24, 2022

xeruf commented Oct 30, 2022 • edited Loading

hypervtechnics commented Nov 13, 2022

adamreed90 commented Feb 27, 2024

TwiN commented Apr 28, 2024 • edited Loading

marrold commented May 18, 2024

joryirving commented May 20, 2024 • edited Loading

bradbendy commented Jun 4, 2024

muhlba91 commented Jul 4, 2024

TheAnachronism commented Aug 24, 2024

zubieta commented Oct 24, 2024

TwiN commented Oct 25, 2024

TwiN commented Jan 24, 2021 •

edited

Loading

TwiN commented Aug 11, 2021 •

edited

Loading

TwiN commented Sep 5, 2021 •

edited

Loading

TwiN commented Jul 29, 2022 •

edited

Loading

hypervtechnics commented Aug 26, 2022 •

edited

Loading

xeruf commented Oct 24, 2022 •

edited

Loading

xeruf commented Oct 30, 2022 •

edited

Loading

TwiN commented Apr 28, 2024 •

edited

Loading

joryirving commented May 20, 2024 •

edited

Loading