Concurrency, distributed computing, and time #1504

edwardalee · 2022-12-12T19:24:09Z

edwardalee
Dec 12, 2022
Maintainer

I am starting this discussion to collect some foundational background in concurrency, distributed computing, and time, as it relates to LF. I think this discussion could be useful as we start to seriously work on fault tolerance and mutations. Feel free to edit this directly.

Clock Synchronization and Mutations

Brewer has an interesting discussion of the role of TrueTime, the clock synchronization mechanism in Google Spanner. He says that Spanner choses consistency over availability despite its use of something like our decentralized coordination, which choses the opposite. This seems to be because it overlays a Paxos and two-phase commit protocol. He cites several key roles that TrueTime plays, but one important one that we have not yet exploited is the ability to create snapshots from which one can recover from failures.

Mutations have a potentially interesting relationship to schema changes in databases.
These can be facilitated using clock synchronization.
Brewer states about the clock synchronization in Spanner:

Snapshots are about the past, but you can also agree on the future. A feature of Spanner is that you can agree on the time in the future for a schema change. This allows you to stage the changes for the new schema so that you are able to serve both versions. Once you are ready, you can pick a time to switch to the new schema atomically at all replicas.

Essential Reading

The following papers are particularly useful background reading:

Schneider (1990): Tutorial on fault tolerance in distributed systems.
Fischer Lynch Paterson (1985): FLP: only two of safety (consistency), liveness, and fault tolerance.
Liskov (1991): Practical uses of synchronized clocks in distributed systems

ChadliaJerad · 2023-01-24T14:26:20Z

ChadliaJerad
Jan 24, 2023
Collaborator

Based on previous discussions with @edwardalee, this proposal addresses the dynamic federated execution problem. The basic idea is to endow federates with the capability to join and leave a federation on runtime. To this end, federate reactors are classified into two categories: persistent and transient. The table below elaborates on the differences:

	Persistent federate	Transient federate
Start of execution	Mandatory to start	Not mandatory
Life time	Spans over the entire LF app lifetime	Can join and leave at arbitrary times
Points of failure	Yes	No

In this combination, the RTI has the properties of a persistent federate.

Illustrative example

Software Defined Vehicle (SDV) is a target example, where vehicles (transient federates) can reach and leave an intersection. The intersection (part of the infrastructure) is a permanent federate reactor.

A very naive LF implementation can be:

Implementation wise

Backward compatibility can be guaranteed using the following syntax:
RTI -n <#PersistentFederates> -t <#TransientFederates>.
By default, a federate is persistent. Transient federates can be specified using annotations.
Persistency and transitory properties are applicable only at the top level.
If a reactor is annotated at the top level as transient, while the top reactor is not a federation, then the behavior defaults to the static one.

The current RTI execution pattern is abstractly reported in the UML activity diagram-like below on the left. The diagram in the right side is a first attempt towards the desired behavior (draft). This diagram is to be updated each time with the adopted design decisions.

Challenges

There are, at least, two main challenges:

keeping logical time consistent across reactors all the time.
And decide what to do in case of decentralized coordination.

To address challenge 1., we can start by the proposal in the UML-like sequence diagram below (derived from discussions with Edward). Please note that the diagram does not follow the exact semantics of UML!

Relationship to Mutation

This capability can be seen as an alternative solution to using Mutations, in the way they were defined and used in the Accessors project.
AFAIK, in Lingua-Franca, mutations are particular case of reactions. This proposal suggests Mutators to be a particular case of reactors with the following analogy:

Persistent federate reactor ⇒ Reactor
Transient federate reactor ⇒ Mutator

In this sense, mutators will enable inner topology update of a reactor.

Possible mid-term future directions

The described mechanism can be further augmented to support fault tolerance using hot spares (learned from @hokeun). In such scenario, a permanent federate will have a transient twin as a backup. The permanent reactor will periodically (or when needed) send his logical time and important state values to its transient twin. Maybe alive messages as well?
Then, when the permanent federate fails, the transient one is upgraded to become permanent?... To be discussed.

There is a possible nice opportunity in making Lingua-Franca a framework that enables deploy forever strategy (as opposite to deploy once). In fact, when updating a transient federate with a new one, its new topology may cause zero-delay loops to appear. Longua-Franca framework can make sure no such non desirable situation happen.

Another add-on to the RTI infrastructure is in creating messages that will update the STP or deadline values based on the new adopted topology of the transient federate (if it is a Mutator). The Framework derives the new values, and when a transient is to be updated, it tells the RTI what new values to broadcast among running reactors (before everybody starts)... To be discussed.

0 replies

cmnrd · 2023-01-26T15:47:38Z

cmnrd
Jan 26, 2023
Maintainer

The proposal seems to discuss mostly technical aspects of how the RTI can interact with transient federates. I wonder how this translates to the semantics of LF and the implications it has to programs. For instance: What does it mean for a federate to join and leave? When are shutdown and startup of transient reactors invoked? How can a federate that joined leave again? What happens in case of network failure (is this considered leaving?)? What happens to messages that are sent to a transient federate while it is not connected? Are the transient reactors expected to have persistent state, or are they restarted every time? If they are restarted, how can we deal with messages that arrive (logically) after a reconnect, but that were actually meant for a previous connection?

These questions are probably hard to answer at the moment, but I think we should try to get a clearer picture of the intended semantics and application requirements before thinking about concrete protocols.

1 reply

edwardalee Mar 6, 2023
Maintainer Author

With apologies for the delay, here is an attempt to write down what @ChadliaJerad and I discussed about the meaning a couple of months ago:

Meaning of Transient Federates

We propose that at any tag g, a transient federate is either present or absent. If it is absent, then, semantically, it produces no output events at g and does not react to inputs at g (such inputs need not be presented to it). If it is present at g, then it reacts as usual to inputs and can produce outputs.

The transition from absent to present or present to absent of a federate is an event that triggers startup and shutdown reaction, respectively. Note that this is a bit similar to entering a mode with a reset transition, but also a bit different: it is more like a mode where all entering transitions are reset transitions. Hence, a federate that becomes present and then absent is not really the same entity as a federate with the same ID that enters later.

In centralized coordination, if a present federate B has an upstream absent federate A, then tag advancement at the present federate to tag g depends on the possibility of A joining the federation before or at g. The mechanism @ChadliaJerad describes above ensures while A is absent, an RTI performing centralized coordination does not need to communicate with A to allow B to advance its tag. This is an essential requirement of any mechanism that tolerates transient federates.

For decentralized coordination, the STA (safe-to-advance) offset of a federate B that is downstream of an absent federate A must take into account not just clock synchronization errors and network latency, but also the startup time of A. I think this can be mitigated by ensuring that the tag $g$ that A joins the federation be set sufficiently ahead of A's local physical clock to compensate for this time and by allowing A to perform startup functions ahead of physical time.

Federates that resign from the federation at tag g present no difficulty for either centralized or decentralized coordination, although we may want a downstream federate to receive a notification of this fact. E.g., given an input x to a federate B, we might write:

    reaction(x) {=
        ... react to inputs from _A_ ...
    =} resign {=
        ... react to the resignation of _A_ ...
    =}

Given such a mechanism, it may be reasonable to allow any federate to resign at any arbitrary tag in response to some local event. If we allow this, then a "transient" federate is simply one that also join after the federation has started.

A good test case for transient federates would be a Chord algorithm or, more simply, a leader election algorithm.

Failure of a Federate

If federate A is upstream of federate B and A fails (crashes), then it does not resign, and therefore, there is no well-defined tag g at which it becomes absent. I believe this should be treated as a fault condition (like an STP violation or a deadline violation). This could be handled something like this:

    reaction(x) {=
        ... react to inputs from _A_ ...
    =} failure {=
        ... react to the detected failure of _A_ ...
    =}

In centralized coordination, this would be relatively easy to implement. The RTI detects a failure when it attempts a socket communication with A and the function returns an error. Instead of exiting (the current behavior of the RTI), it should treat this as a failure, choose a logical time for that failure, and notify B. Alternatively, failure could use the physical time at time B, in which case, the difference between resign and failure is like the difference between logical connections and physical connections. In the latter case, federate B must not rely on having received any "last words" from A.

cmnrd · 2023-01-26T15:57:36Z

cmnrd
Jan 26, 2023
Maintainer

To reply to one of Edwards comments:

He cites several key roles that TrueTime plays, but one important one that we have not yet exploited is the ability to create snapshots from which one can recover from failures.

I think I have mentioned Ambrosia before: https://irenezhang.net/papers/ambrosia-vldb19.pdf. Its a project from Microsoft research where they have "imortal" services. The Imortals record all incoming requests and take regular snapshot. In case of failure, the snapshot is restored and then all recorded messages after the last snapshot are resend. This however, only works if the services are deterministic (which btw is a problem they have; when we talked to Jonathan Goldstein, he siad that they don't know how to ensure that the services are deterministic). I think the concepts of Ambrosia could pair well with federated LF, but so far I didn't have the time to look into it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrency, distributed computing, and time #1504

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Concurrency, distributed computing, and time #1504

edwardalee Dec 12, 2022 Maintainer

Clock Synchronization and Mutations

Essential Reading

Replies: 3 comments · 1 reply

ChadliaJerad Jan 24, 2023 Collaborator

Illustrative example

Implementation wise

Challenges

Relationship to Mutation

Possible mid-term future directions

cmnrd Jan 26, 2023 Maintainer

edwardalee Mar 6, 2023 Maintainer Author

Meaning of Transient Federates

Failure of a Federate

cmnrd Jan 26, 2023 Maintainer

edwardalee
Dec 12, 2022
Maintainer

Replies: 3 comments 1 reply

ChadliaJerad
Jan 24, 2023
Collaborator

cmnrd
Jan 26, 2023
Maintainer

edwardalee Mar 6, 2023
Maintainer Author

cmnrd
Jan 26, 2023
Maintainer