Runtime Dynamism in Federated Execution: Transient Federates and Hot Swap Mechanism #2212
Replies: 8 comments 1 reply
-
Lingua Franca PR: #2213 |
Beta Was this translation helpful? Give feedback.
-
Frankly, I am not convinced by this proposal. It is unclear to me which actual problem this solves, while it seems to come at quite a significant cost. The title says that transient federates provide runtime dynamism. But what exactly does this mean? Which specific use-cases are enabled by this proposal? And more importantly, which additional problems do we introduce when we add this to the language? At which cost does this come? Your motivation mentions two use cases:
Theses are two very different use-cases with very different requirements. As I understand the proposal, it completely relies on stopping and restarting processes as the mechanism for joining and leaving a federation. This is at odds with my understanding of the two use-cases. In the scenario where components come and leave, I would expect them to execute as continuous processes. A car will continue to operate when leaving an intersection. So the car is transient from the perspective of the crossing. But the crossing is also transient from the point of view of the car. How is this addressed in your proposal? Typically, hot swapping explicitly requires that processes are not shut down and are replaced seamlessly without a gap in service. I don't see how this can be achieved with the current proposal. In particular, transfer of state (both parameters/state variables and the local event queue) doesn't seem to be a consideration at all. I might be missing something. So, I would like to ask you to be concrete about your envisioned use-case. What are the requirements? Why are our existing solutions not sufficient? How does your proposal address these requirements? What is not addressed by your proposal? And what are the drawbacks? So far, the proposal sounds nice on a superficial level, but it is unclear which actual problem it solves and how the proposed solution actually addresses the requirements of the motivational examples. I would also like to understand how this relates to modal models. In particular, the academic use-case with the The proposal does not discuss any safeguards regarding the behavior of transient federates. If there are none, this effectively means that we compile an LF program assuming a certain interface of the transient reactor. But then we can replace this reactor with an arbitrary implementation at runtime. This is problematic for two reasons. First, it opens a loophole in our semantics. Second, it is a severe security vulnerability. It is an open invitation to inject arbitrary code. This proposal appears to assume an educated, disciplined and benevolent user. But the reality is, that such a loophole will be used to circumvent our semantics (knowingly or unknowingly) and to attack a running application.
Security can rarely be "added" after the fact. It either is a design consideration from the ground up or there is no security. In summary, I see a huge cost without a clear benefit. |
Beta Was this translation helpful? Give feedback.
-
Indeed, the use cases have different requirements. It is on purpose. But at the core, they both would need the support of joining and leaving during the execution. The proposal is meant to enable such use cases. It is not, indeed, the complete solution.
There are, indeed, different ways to solve a problem. And the use case is far from being a complete one.
Here, the transient to hot swap will stop and then the new instance will start. There is no absolute seamlessness.
I agree that a concrete and complete example will help significantly.
In modal models, behaviors are set at the design time. Transients will enable evolving with a different behavior during execution.
I indeed overlooked adding a fixme in
Currently, the implemented security mechanism (by @hokeun's team) is supported in transients. The use of
As far as I know, authentication was added to LF after the core semantics were set and implemented. I am not aware that security was considered from the ground up, right? I know that this is a work in progress though. |
Beta Was this translation helpful? Give feedback.
-
I see this work as a much-needed first step towards having LF programs that can run reliably and usefully for months or years. I also see it as a first step towards fault tolerance, where a federate can fail, recover, and rejoin, or fail and be replaced. The key concept that this PR demonstrates is the development of agreement among affected federates about the logical time at which the joining federate joins. This is a natural extension of our startup mechanism, and it inherits (or can inherit... I don't think this is implemented) the same security (or lack of security) from the initial startup (which already has a nice authentication mechanism). I think it would be a mistake to bury this work because it's an incomplete solution. We won't know what a complete solution looks like until we start building applications with a partial solution. Notice that when a federate leaves because of failure (as opposed to resigning or being forced out in a hot swap), there is a fundamentally unavoidable source of possible inconsistencies, particularly with decentralized coordination. There is no way to ensure that all observers agree on the tag at which the federate left. I suspect we could prove this as a theorem. However, centralized coordination mitigates the risk because, assuming the RTI doesn't fail, then all federates will agree on the tag of the last tagged message sent by a failed federate (but not on the last physical message, but this probably OK with our semantics). As for the use cases, I would go so far as to say that nearly every distributed application is a potential use case. If you talk to distributed systems people, they put most of their effort into dealing with transient participants in their applications. I would like to see us develop, for example, built-in support for quorum-based agreement, a relaxed (but disciplined) form of consistency that enhances availability. For a canonical use case, I suggest a chat application. This is easy to build, obviously needs transient participants, and the guarantee we can provide is that all observers see chat messages in the same order. This becomes particularly interesting if you have separate but overlapping chat rooms. Any two observers that are in overlapping chat rooms will see messages in the same order even across chat rooms. This can help establish and enforce causality chains. Our first CAL Theorem paper has such an example, though without the multiple chat rooms, and hence much simpler. A particularly interesting challenge would be to create a decentralized version where there is no single persistent federate. This is not addressed in this PR, but it provides a good starting point. As a side note, I think that supporting transient federates with decentralized coordination will be easier than with centralized. This PR addresses the harder of the two problems. Once we extend this to support decentralized coordination, the next natural step would be a fault tolerant RTI. The RTI currently plays no role during execution in decentralized coordination, but with transient federates, it again has a role. This would be a perfect opportunity to realize a leader election schema for restoring a failed RTI. In this case, one interesting twist is that if we find ourselves with a partitioned network, we probably do want two RTIs, unlike the leader-election test case currently in the playground. An interesting question then becomes how to handle repair of the partitioned network. I also agree with @ChadliaJerad that restoring state during a hot swap should be handled by the application, not by the framework, at least in the near term. However, long term, providing mechanisms for creating snapshots of state would be extremely useful. In all our current targets, none of which use languages with built-in persistent state (like Java), this would have to be done by providing a way for application developers to provide a serialize and deserialize function for each reactor. This is clearly out-of-scope for this PR, but it would make a great project. |
Beta Was this translation helpful? Give feedback.
-
I think there is a big distinction between "burry this work" and "publish about this work but do not mainline the feature because it is experimental." Just because work exists and is interesting (which I agree it is) does not mean that it is ready to be merged; not merging also doesn't imply that we're burying it. If we can't agree about this, then we need to have another hard look at the RFC track we're outlining and re-evaluate our willingness to commit to it, because that effort is explicitly meant to offer a clear process to navigate discussions like the one we're having in this thread. I think that it would be a mistake to ignore that process and forge ahead like we use to, with the criterion that if something is interesting then it must be worth merging. I also want to emphasize that in the open-source community, there is absolutely no shame in contributions not being mainlined. In fact, the default practice is to fork, and the number of forks of a project is actually seen as an important measure of its success. Usually, external contributors are driven by a specific need for a feature that they choose to develop on their own accord without or any sort of approval or the expectation that it will get merged upstream. If they want to go through the effort of proposing such merge, and if it finally does get merged, then that's great. But if a merge does not happen, it just means that maintenance of the feature befalls on the feature developer rather than the mainline maintainers. I don't think it's helpful to use negatively charged language to describe the latter situation. If anything, we need to be realistic and responsible when deciding what maintenance burden we're willing and able to take on, and this will be critical to the survival of the project. |
Beta Was this translation helpful? Give feedback.
-
I agree. My main criticism with the proposal, as it is, is that it does not clearly state the problem that it solves. Instead, it sketches problems that we are not even close to solving. I would even go as far as saying that the stated problems are unrelated to the proposed solution. If we had an actual hot-swapping mechanism (like for instance Erlang implements it), then rejoining wouldn't be a problem as even the RTI wouldn't need to notice that the implementation changed. And in the intersection example, it is a requirement (not a design decision) that both the vehicles and the intersection operate independently and continuously. In this scenario, both parties have already started when they meet, and they already have events in the queue. This setting is very different from the one considered in the proposal. We need to be conscious about the expectations that we set, both internally and externally. If we tell people that LF supports hot-swapping, then we better have a solution that lives up to the user expectations. Otherwise, we will have frustrated users. And internally, I strongly believe that we should evaluate design proposals based on whether the stated problem is relevant, whether the design effectively addresses the problem, and at which costs it does so. The focus should be on the actual problem solved, not on shiny problems it might solve sometime in the feature. Otherwise, we risk being deluded by the promises, and less perceptive to understanding the costs. So if I ask myself if the proposal meets the expectation that it sets, then the answer is clearly no. This is not to say that there is no value in the proposal. I have my concerns about the concrete integration into the language, but the conceptual considerations on what it means for a federate to (re)join are certainly relevant. What I would like to see is a proposal that clearly states the problem (if possible and applicable based on a use-case), that identifies requirements, that describes the design and how it addresses the problem, and that openly discusses drawbacks and costs. I also think that an RFC would be the appropriate format for this. Having such an RFC, we could have a more focused discussion on the relevance of the problem to LF, the effectiveness of the solution, and the involved costs. This would then hopefully prepare us to decide if we want to integrate the proposed solution in LF mainline or not. |
Beta Was this translation helpful? Give feedback.
-
I mentioned this several times before, but I'll mention it again: Lee and Seshia's definition of refinement is not sufficient to guarantee that a component is compatible with a federation. If a substitution happens with a component that has the same ports but a different internal structure, then this can easily lead to deadlock. I suppose this can be sidestepped by limiting the scope of the mechanism such that it is explicitly disallowed to insert new implementations of components at runtime, but that is not what the title and description of this discussion suggest. Another (more practical) concern about the feasibility of providing "bug fixes and upgrades" while a federation is running, is that I don't think we actually have a way to independently compile federates currently. This is mostly because the code that each federate compiles down to is very federation-specific. Therefore, I have to agree with @cmnrd, that we're quite a ways away from offering robust "hot swap" capability, which is a deep topic in its own right, with a number of unaddressed technical challenges, some of which are highly specific to our current implementation. For that reason, I encourage us to narrow down the scope of the discussion, focus on the semantics of "transience," and resist the temptation to consider the contributions in #2213 and lf-lang/reactor-c#358 as potential enablers of lofty goals and complex functionality that to up to this point is merely speculative. As far as my understanding goes, a transient federate is a federate that:
If we can focus on that functionality and forget about everything else, I think we'll have a much better shot at reaching a common understanding. |
Beta Was this translation helpful? Give feedback.
-
I have been thinking about executing mixed criticality systems using federated LF. My conclusion thus far is that different criticality levels should be coordinated decentrally. We must accept that federates crash and that they are restarted an re-join the federation without any issue. I think we could start by answering Marten's question for decentralized coordination:
If we can agree on (1) and (2) we could go quite quickly prototype fault tolerance/dynamism based on decentralized coordination. |
Beta Was this translation helpful? Give feedback.
-
This discussion documents the support of transient federates in the LF runtime. It is a continuation of the discussion in #1504
Motivation
A federation, that is a distributed Lingua Franca program, starts executing only when all federates have joined and agreed on the start time. Once a federate leaves the federation, re-joining is prohibited. In this process, the RTI (Run Time Infrastructure) plays a key role in coordinating interactions among federates.
This addition aims at endowing federates with the ability to join and leave on runtime, termed transient federates. Their inclusion broadens the spectrum of LF-describable programs, particularly those necessitating dynamic behavior. For instance, envision a road intersection light management system where vehicles and pedestrians come and leave at arbitrary times.
Moreover, transients facilitate the hot swap mechanism, enabling the replacement of a federate without necessitating its manual shutdown, or the shutdown of the entire federation (interrupting its operation). Hot swapping allows maintenance, upgrades, and repairs to be performed on the fly, minimizing downtime and maximizing system availability.
A Sketch of Transient Federates
Federates fall into two types: persistent and transient. Persistent federates must be present for the federation to start and last until its end. In other terms, their execution lifetime equals the federation's execution lifetime. They can only be present once in a federation's lifetime.
Transient federates, however, can join and leave anytime during the federation's execution lifetime. They are not required to be present for the federation to start.
The federation has three phases: startup, execution, and shutdown. Persistent federates join at startup and leave at shutdown. Transients can join and leave multiple times during the startup or the execution phases.
When a transient is absent, messages sent to it are dropped by the RTI. Downstream federates of a transient only receive messages when it is present.
Challenges to Support Transient Federates (Centralized Coordination)
Axioms to Observe
A subset of the axioms that govern Lingua Franca semantics, which are relevant to transient implementation are identified:
Identified Challenges
The study of the support of transient federates led to the identification of these 4 challenges:
The LF program below will serve as an example and summarize the aforementioned challenges.
Overview of the Solution Implementation
How to Issue a (P)TAG when a Transient is Absent?
In accordance with Axiom (A4), the issuance of (P)TAGs exclusively impacts the downstream federates of a transient, as they are required to progress their logical time even in the transient's absence. Considering that (P)TAGs can be set far into the future, it is preferable to issue them at their intended time (based on Axiom (A2)), to prevent transients from experiencing prolonged wait times before starting execution.
Consistent with Axiom (A3), the RTI will issue delayed (P)TAGs for the downstream federates of an absent transient. Put differently, notifications for TAGs and PTAGs are postponed if a federate has at least one upstream transient that is absent. A dedicated thread will manage this delay.
A corner case is when a federate has all its upstream federates as transients. If all transients are absent, the TAG will default to NET (Next Event Tag).
Effective Start Tag of a Joining Transient
For a federation to start, all persistent federates need to agree on the
start_time
. A transient federate can join at any time instantjoin_time
that is higher or equal tostart_time
. Since the RTI may have already issued (P)TAGs based on the topology where the transient was absent, the grants mustn't be compromised. We need therefore to derive theeffective_start_tag
of the joining transient.Since logical time is chasing physical time (Axiom (A1)) and clocks are synchronized (Axiom (A2)), we only need to check the (P)TAG of the transient's downstream federates. Consequently:
effective_start_tag
is computed as follows:effective_start_tag(transient) = max(join_time(transient), ((P)TAG(downstream(transient)+1 microstep))
It naturally follows that the
start_time
of persistent federates is equal to theireffective_start_time
, while transient federates have theireffective_start_time
higher or equal to theirstart_time
.How will a Timer Execute in a Transient Federate?
Let's consider the federated LF program below, where
mid
is a transient federate.An idealized representation of the executions of
mid
andup
federates, withmid
operating as a persistent one resembles the upper diagram below. If, however, we consider the scenario wheremid
leaves and then joins the federation, when should the timer start? (lower diagram):Concretely, which of the following proposals should be adopted, regarding
mid
timer alignment?The discussion at the LF meeting of 08-23-2023 led to the following consensus:
How to Identify Cycles when a Federation Includes Transient Federates?
The recent RTI refactoring introduced procedures to detect cycles within federations, aimed at minimizing the issuance of unnecessary PTAG and Absent messages.
Two scenarios arise:
TODO: Discuss both proposals.
Note: Currently, the tests do not include multi-level transients.
Hot Swap Mechanism
The How Swap of transient federates is supported. When the RTI receives a connection request from a transient federate (termed
new_fed
) with the same ID as a currently executing one (termedold_fed
), the RTI sends a stop request toold_fed
.old_fed
will force every contained enclave to stop at itscurrent_tag
plus 1 microstep. Upon termination,old_fed
sends aRESIGN
message to the RTI. Subsequently, the RTI will create the communication thread withnew_fed
and proceed with the effective start tag computation.The following constraints are observed:
Testing Transient Federates
Writing test programs for transients is challenging because federates are meant to join and leave on runtime. To this end, the following convenience functions were added:
lf_stop()
: Causes the federate to stop its execution. Every enclave within the federate will stop at one microstep later than its current tag.Unlike lf_request_stop(), this process does not require any involvement from the RTI, nor does it necessitate any consensus.
lf_get_federates_bin_directory()
: gets the directory containing the executables of the individual federates.lf_get_federation_id()
: returns the federation id. This function is useful for creating federates on runtime when testing.lf_get_effective_start_time()
: returns the effective start time of the federate.lf_get_start_time()
: returns the start time of the federate.Except for
lf_stop()
, the remaining functions are not safe to be exposed and used inside the reactions in general.They also cause a warning of
implicit declaration
, which can be avoided by adding the function prototypes to the preamble. This is rather a hack.TODO: Discuss a better way of testing transients, or a better way of exposing these functions.
Future Directions
Beta Was this translation helpful? Give feedback.
All reactions