Concurrency, distributed computing, and time #1504
Replies: 3 comments 1 reply
-
Based on previous discussions with @edwardalee, this proposal addresses the dynamic federated execution problem. The basic idea is to endow federates with the capability to join and leave a federation on runtime. To this end, federate reactors are classified into two categories: persistent and transient. The table below elaborates on the differences:
In this combination, the RTI has the properties of a persistent federate. Illustrative exampleSoftware Defined Vehicle (SDV) is a target example, where vehicles (transient federates) can reach and leave an intersection. The intersection (part of the infrastructure) is a permanent federate reactor. A very naive LF implementation can be: Implementation wise
The current RTI execution pattern is abstractly reported in the UML activity diagram-like below on the left. The diagram in the right side is a first attempt towards the desired behavior (draft). This diagram is to be updated each time with the adopted design decisions. ChallengesThere are, at least, two main challenges:
To address challenge 1., we can start by the proposal in the UML-like sequence diagram below (derived from discussions with Edward). Please note that the diagram does not follow the exact semantics of UML! Relationship to MutationThis capability can be seen as an alternative solution to using Mutations, in the way they were defined and used in the Accessors project.
In this sense, mutators will enable inner topology update of a reactor. Possible mid-term future directionsThe described mechanism can be further augmented to support fault tolerance using hot spares (learned from @hokeun). In such scenario, a permanent federate will have a transient twin as a backup. The permanent reactor will periodically (or when needed) send his logical time and important state values to its transient twin. Maybe alive messages as well? There is a possible nice opportunity in making Lingua-Franca a framework that enables deploy forever strategy (as opposite to deploy once). In fact, when updating a transient federate with a new one, its new topology may cause zero-delay loops to appear. Longua-Franca framework can make sure no such non desirable situation happen. Another add-on to the RTI infrastructure is in creating messages that will update the STP or deadline values based on the new adopted topology of the transient federate (if it is a Mutator). The Framework derives the new values, and when a transient is to be updated, it tells the RTI what new values to broadcast among running reactors (before everybody starts)... To be discussed. |
Beta Was this translation helpful? Give feedback.
-
The proposal seems to discuss mostly technical aspects of how the RTI can interact with transient federates. I wonder how this translates to the semantics of LF and the implications it has to programs. For instance: What does it mean for a federate to join and leave? When are shutdown and startup of transient reactors invoked? How can a federate that joined leave again? What happens in case of network failure (is this considered leaving?)? What happens to messages that are sent to a transient federate while it is not connected? Are the transient reactors expected to have persistent state, or are they restarted every time? If they are restarted, how can we deal with messages that arrive (logically) after a reconnect, but that were actually meant for a previous connection? These questions are probably hard to answer at the moment, but I think we should try to get a clearer picture of the intended semantics and application requirements before thinking about concrete protocols. |
Beta Was this translation helpful? Give feedback.
-
To reply to one of Edwards comments:
I think I have mentioned Ambrosia before: https://irenezhang.net/papers/ambrosia-vldb19.pdf. Its a project from Microsoft research where they have "imortal" services. The Imortals record all incoming requests and take regular snapshot. In case of failure, the snapshot is restored and then all recorded messages after the last snapshot are resend. This however, only works if the services are deterministic (which btw is a problem they have; when we talked to Jonathan Goldstein, he siad that they don't know how to ensure that the services are deterministic). I think the concepts of Ambrosia could pair well with federated LF, but so far I didn't have the time to look into it. |
Beta Was this translation helpful? Give feedback.
-
I am starting this discussion to collect some foundational background in concurrency, distributed computing, and time, as it relates to LF. I think this discussion could be useful as we start to seriously work on fault tolerance and mutations. Feel free to edit this directly.
Clock Synchronization and Mutations
Brewer has an interesting discussion of the role of TrueTime, the clock synchronization mechanism in Google Spanner. He says that Spanner choses consistency over availability despite its use of something like our decentralized coordination, which choses the opposite. This seems to be because it overlays a Paxos and two-phase commit protocol. He cites several key roles that TrueTime plays, but one important one that we have not yet exploited is the ability to create snapshots from which one can recover from failures.
Mutations have a potentially interesting relationship to schema changes in databases.
These can be facilitated using clock synchronization.
Brewer states about the clock synchronization in Spanner:
Essential Reading
The following papers are particularly useful background reading:
Beta Was this translation helpful? Give feedback.
All reactions