Runtime Dynamism in Federated Execution: Transient Federates and Hot Swap Mechanism #2212

ChadliaJerad · 2024-02-20T21:25:30Z

ChadliaJerad
Feb 20, 2024
Collaborator

This discussion documents the support of transient federates in the LF runtime. It is a continuation of the discussion in #1504

Motivation

A federation, that is a distributed Lingua Franca program, starts executing only when all federates have joined and agreed on the start time. Once a federate leaves the federation, re-joining is prohibited. In this process, the RTI (Run Time Infrastructure) plays a key role in coordinating interactions among federates.

This addition aims at endowing federates with the ability to join and leave on runtime, termed transient federates. Their inclusion broadens the spectrum of LF-describable programs, particularly those necessitating dynamic behavior. For instance, envision a road intersection light management system where vehicles and pedestrians come and leave at arbitrary times.

Moreover, transients facilitate the hot swap mechanism, enabling the replacement of a federate without necessitating its manual shutdown, or the shutdown of the entire federation (interrupting its operation). Hot swapping allows maintenance, upgrades, and repairs to be performed on the fly, minimizing downtime and maximizing system availability.

A Sketch of Transient Federates

Federates fall into two types: persistent and transient. Persistent federates must be present for the federation to start and last until its end. In other terms, their execution lifetime equals the federation's execution lifetime. They can only be present once in a federation's lifetime.

Transient federates, however, can join and leave anytime during the federation's execution lifetime. They are not required to be present for the federation to start.

The federation has three phases: startup, execution, and shutdown. Persistent federates join at startup and leave at shutdown. Transients can join and leave multiple times during the startup or the execution phases.

When a transient is absent, messages sent to it are dropped by the RTI. Downstream federates of a transient only receive messages when it is present.

Challenges to Support Transient Federates (Centralized Coordination)

Axioms to Observe

A subset of the axioms that govern Lingua Franca semantics, which are relevant to transient implementation are identified:

(A1) Logical time is chasing physical time
(A2) Federates synchronize their clocks with the RTI
Per observer:
- The RTI is the observer, at any physical time instant t:
  - (A3) The RTI knows the most recent TAG and PTAG of federates before the federates themselves
  - (A4) ∀ fed’ ∈ Set_of_downstrem(fed) ⇒ TAG(fed’) <= TAG(fed)
- The federate is the observer, at any physical time instant t:
  - (A5) The federate knows its most recent CurrentTag, LTC, NET, before the RTI

Identified Challenges

The study of the support of transient federates led to the identification of these 4 challenges:

How to compute a (Provisional) Tag Advance Grant when a transient federate is absent?
When a transient federate joins, how to decide about its effective start tag, without compromising Lingua Franca semantics?
How to interpret the offset of the transient timer when it joins at the execution phase?
How to identify cycles when a federation includes transient federates?

The LF program below will serve as an example and summarize the aforementioned challenges.

Overview of the Solution Implementation

How to Issue a (P)TAG when a Transient is Absent?

In accordance with Axiom (A4), the issuance of (P)TAGs exclusively impacts the downstream federates of a transient, as they are required to progress their logical time even in the transient's absence. Considering that (P)TAGs can be set far into the future, it is preferable to issue them at their intended time (based on Axiom (A2)), to prevent transients from experiencing prolonged wait times before starting execution.

Consistent with Axiom (A3), the RTI will issue delayed (P)TAGs for the downstream federates of an absent transient. Put differently, notifications for TAGs and PTAGs are postponed if a federate has at least one upstream transient that is absent. A dedicated thread will manage this delay.

A corner case is when a federate has all its upstream federates as transients. If all transients are absent, the TAG will default to NET (Next Event Tag).

Effective Start Tag of a Joining Transient

For a federation to start, all persistent federates need to agree on the start_time. A transient federate can join at any time instant join_time that is higher or equal to start_time. Since the RTI may have already issued (P)TAGs based on the topology where the transient was absent, the grants mustn't be compromised. We need therefore to derive the effective_start_tag of the joining transient.

Since logical time is chasing physical time (Axiom (A1)) and clocks are synchronized (Axiom (A2)), we only need to check the (P)TAG of the transient's downstream federates. Consequently:

The effective_start_tag is computed as follows: effective_start_tag(transient) = max(join_time(transient), ((P)TAG(downstream(transient)+1 microstep))
Additionally, all pending (P)TAGs of the downstream federates will be canceled.

It naturally follows that the start_time of persistent federates is equal to their effective_start_time, while transient federates have their effective_start_time higher or equal to their start_time.

How will a Timer Execute in a Transient Federate?

Let's consider the federated LF program below, where mid is a transient federate.

federated reactor {
  up = new Up()
  @transient
  mid = new Middle()
  up.out -> mid.in
}

An idealized representation of the executions of mid and up federates, with mid operating as a persistent one resembles the upper diagram below. If, however, we consider the scenario where mid leaves and then joins the federation, when should the timer start? (lower diagram):

Concretely, which of the following proposals should be adopted, regarding mid timer alignment?

The discussion at the LF meeting of 08-23-2023 led to the following consensus:

Default: proposal 2, that is without alignment. This is in accordance with the way timers execute in modal models.
Alignment can be achieved using logical actions and it is possible to provide a convenience API to this end.

How to Identify Cycles when a Federation Includes Transient Federates?

The recent RTI refactoring introduced procedures to detect cycles within federations, aimed at minimizing the issuance of unnecessary PTAG and Absent messages.

Two scenarios arise:

If the transient is not part of a cycle, we are fine. Additionally, the system may permit multi-level transients, wherein a transient can have another transient as a downstream federate.
If, however, the transient is involved in a cycle, two sub-cases are identified:
- If the transient has only persistent federates as downstream or upstream federates, then it is possible for the RTI to derive the existence of the cycle during the startup phase, that is even before the transient joins.
- Conversely, if multi-level transients are in a cycle, then the RTI will not be able to derive the existence of the cycle during the startup phase. Two potential solutions are:
  1. Implement a validation rule that identifies federates with multi-level transients in a cycle as erroneous.
  2. Opt not to support multi-level transients at all, although this approach may be overly restrictive.

TODO: Discuss both proposals.
Note: Currently, the tests do not include multi-level transients.

Hot Swap Mechanism

The How Swap of transient federates is supported. When the RTI receives a connection request from a transient federate (termed new_fed) with the same ID as a currently executing one (termed old_fed), the RTI sends a stop request to old_fed. old_fed will force every contained enclave to stop at its current_tag plus 1 microstep. Upon termination, old_fed sends a RESIGN message to the RTI. Subsequently, the RTI will create the communication thread with new_fed and proceed with the effective start tag computation.

The following constraints are observed:

Hot swap operates only over transient federates.
Only one Hot Swap operation is permitted at a time.
Hot swapping is allowed only at the execution phase. In other terms, if a transient federate has connected during the startup phase, it cannot be hot-swapped.

Testing Transient Federates

Writing test programs for transients is challenging because federates are meant to join and leave on runtime. To this end, the following convenience functions were added:

lf_stop(): Causes the federate to stop its execution. Every enclave within the federate will stop at one microstep later than its current tag.
Unlike lf_request_stop(), this process does not require any involvement from the RTI, nor does it necessitate any consensus.
lf_get_federates_bin_directory(): gets the directory containing the executables of the individual federates.
lf_get_federation_id(): returns the federation id. This function is useful for creating federates on runtime when testing.
lf_get_effective_start_time(): returns the effective start time of the federate.
lf_get_start_time(): returns the start time of the federate.

Except for lf_stop(), the remaining functions are not safe to be exposed and used inside the reactions in general.
They also cause a warning of implicit declaration, which can be avoided by adding the function prototypes to the preamble. This is rather a hack.
TODO: Discuss a better way of testing transients, or a better way of exposing these functions.

Future Directions

Consider the case of decentralized coordination
Formally prove that the solution satisfies the axioms
Add security

ChadliaJerad · 2024-02-20T21:59:11Z

ChadliaJerad
Feb 20, 2024
Collaborator Author

Lingua Franca PR: #2213
Reactor-C PR: lf-lang/reactor-c#358

0 replies

cmnrd · 2024-02-22T09:07:09Z

cmnrd
Feb 22, 2024
Maintainer

Frankly, I am not convinced by this proposal. It is unclear to me which actual problem this solves, while it seems to come at quite a significant cost.

The title says that transient federates provide runtime dynamism. But what exactly does this mean? Which specific use-cases are enabled by this proposal? And more importantly, which additional problems do we introduce when we add this to the language? At which cost does this come?

Your motivation mentions two use cases:

For instance, envision a road intersection light management system where vehicles and pedestrians come and leave at arbitrary times.

Moreover, transients facilitate the hot swap mechanism, enabling the replacement of a federate without necessitating its manual shutdown, or the shutdown of the entire federation (interrupting its operation). Hot swapping allows maintenance, upgrades, and repairs to be performed on the fly, minimizing downtime and maximizing system availability.

Theses are two very different use-cases with very different requirements. As I understand the proposal, it completely relies on stopping and restarting processes as the mechanism for joining and leaving a federation. This is at odds with my understanding of the two use-cases. In the scenario where components come and leave, I would expect them to execute as continuous processes. A car will continue to operate when leaving an intersection. So the car is transient from the perspective of the crossing. But the crossing is also transient from the point of view of the car. How is this addressed in your proposal?

Typically, hot swapping explicitly requires that processes are not shut down and are replaced seamlessly without a gap in service. I don't see how this can be achieved with the current proposal. In particular, transfer of state (both parameters/state variables and the local event queue) doesn't seem to be a consideration at all.

I might be missing something. So, I would like to ask you to be concrete about your envisioned use-case. What are the requirements? Why are our existing solutions not sufficient? How does your proposal address these requirements? What is not addressed by your proposal? And what are the drawbacks? So far, the proposal sounds nice on a superficial level, but it is unclear which actual problem it solves and how the proposed solution actually addresses the requirements of the motivational examples.

I would also like to understand how this relates to modal models. In particular, the academic use-case with the Middle reactor seems like it could be implemented with modal models.

The proposal does not discuss any safeguards regarding the behavior of transient federates. If there are none, this effectively means that we compile an LF program assuming a certain interface of the transient reactor. But then we can replace this reactor with an arbitrary implementation at runtime. This is problematic for two reasons. First, it opens a loophole in our semantics. Second, it is a severe security vulnerability. It is an open invitation to inject arbitrary code. This proposal appears to assume an educated, disciplined and benevolent user. But the reality is, that such a loophole will be used to circumvent our semantics (knowingly or unknowingly) and to attack a running application.

Future Directions

Add security

Security can rarely be "added" after the fact. It either is a design consideration from the ground up or there is no security.

In summary, I see a huge cost without a clear benefit.

0 replies

ChadliaJerad · 2024-02-22T12:40:06Z

ChadliaJerad
Feb 22, 2024
Collaborator Author

Theses are two very different use-cases with very different requirements. As I understand the proposal, it completely relies on stopping and restarting processes as the mechanism for joining and leaving a federation. This is at odds with my understanding of the two use-cases.

Indeed, the use cases have different requirements. It is on purpose. But at the core, they both would need the support of joining and leaving during the execution. The proposal is meant to enable such use cases. It is not, indeed, the complete solution.

In the scenario where components come and leave, I would expect them to execute as continuous processes. A car will continue to operate when leaving an intersection. So the car is transient from the perspective of the crossing. But the crossing is also transient from the point of view of the car. How is this addressed in your proposal?

There are, indeed, different ways to solve a problem. And the use case is far from being a complete one.
Initially, the target scenario is the first interpretation you described (the crossing is persistent and the car is transient), because the crossing light management is meant to coordinate different cars. The second interpretation does not fall into this.
Furthermore, I rather see this as a design choice...
A pending question in the back of my head though is whether to support having all federates be transients or not... This will lead to even more costs. Currently, at least one federate needs to be persistent. An obvious question will be about how to define the start time of the federation. But so far, I do not see a concrete use case for this.

Typically, hot swapping explicitly requires that processes are not shut down and are replaced seamlessly without a gap in service. I don't see how this can be achieved with the current proposal. In particular, transfer of state (both parameters/state variables and the local event queue) doesn't seem to be a consideration at all.

Here, the transient to hot swap will stop and then the new instance will start. There is no absolute seamlessness.
The state transfer is rather a particular case (which can also dominate the use cases...). And this can be performed manually, in the sense that the developer can add a timed reaction where the considered state (defined by the developer) will be sent to a persistent federate where it will be stored. Once the second instance joins, the startup reaction will include retrieving the state.
I am currently working on an example that showcases this behavior. It will be great to discuss this further based on it. Meanwhile, feedback and suggestions are more than welcome.

I might be missing something. So, I would like to ask you to be concrete about your envisioned use-case. What are the requirements? Why are our existing solutions not sufficient? How does your proposal address these requirements? What is not addressed by your proposal? And what are the drawbacks? So far, the proposal sounds nice on a superficial level, but it is unclear which actual problem it solves and how the proposed solution actually addresses the requirements of the motivational examples.

I agree that a concrete and complete example will help significantly.
@edwardalee mentioned before, if I recall accurately, the case of fault-tolerant deployment in the cloud.
And I strongly think that bug fixes and upgrades on runtime are straightforward use cases.
But I argue that the contribution here is about how to perverse the timing semantics on runtime when dynamism is supported.
There is still work to be done, as you mention next, in terms of making the process safe...

I would also like to understand how this relates to modal models. In particular, the academic use-case with the Middle reactor seems like it could be implemented with modal models.

In modal models, behaviors are set at the design time. Transients will enable evolving with a different behavior during execution.
In the tests, however, the behavior across executions is the same. But in a real running case, we can hot plug a federate with a different behavior. I ran such an example during one of the LF meetings.

The proposal does not discuss any safeguards regarding the behavior of transient federates. If there are none, this effectively means that we compile an LF program assuming a certain interface of the transient reactor. But then we can replace this reactor with an arbitrary implementation at runtime.

I indeed overlooked adding a fixme in receive_connection_information() in rti_remote.c (with the latest PRs) to check if the interface of the new joining federate changed from the previous run or not, or if we decide to support type refinement (See 14.2
Type Equivalence and Refinement in Lee and Seshia...

This is problematic for two reasons. First, it opens a loophole in our semantics. Second, it is a severe security vulnerability. It is an open invitation to inject arbitrary code. This proposal appears to assume an educated, disciplined and benevolent user. But the reality is, that such a loophole will be used to circumvent our semantics (knowingly or unknowingly) and to attack a running application.

Currently, the implemented security mechanism (by @hokeun's team) is supported in transients. The use of federation ids helps as well, but it is not enough. Totally agree that the mechanism is as powerful as dangerous.
A step towards a workaround would be to constraint the acceptance of a hot swap mechanism... I did not start on concertizing proposals so far... Any suggestion is more than welcome!

Future Directions

Add security

Security can rarely be "added" after the fact. It either is a design consideration from the ground up or there is no security.

As far as I know, authentication was added to LF after the core semantics were set and implemented. I am not aware that security was considered from the ground up, right? I know that this is a work in progress though.
I am very far away from being a security expert, but can you enlighten me on how was it considered so far, and which process is advised to follow?
In my limited understanding, the place in the flow where the decision will be made is clear. What remains is what constraints to opt for, and then they should be added.

0 replies

edwardalee · 2024-02-22T15:57:51Z

edwardalee
Feb 22, 2024
Maintainer

I see this work as a much-needed first step towards having LF programs that can run reliably and usefully for months or years. I also see it as a first step towards fault tolerance, where a federate can fail, recover, and rejoin, or fail and be replaced. The key concept that this PR demonstrates is the development of agreement among affected federates about the logical time at which the joining federate joins. This is a natural extension of our startup mechanism, and it inherits (or can inherit... I don't think this is implemented) the same security (or lack of security) from the initial startup (which already has a nice authentication mechanism).

I think it would be a mistake to bury this work because it's an incomplete solution. We won't know what a complete solution looks like until we start building applications with a partial solution.

Notice that when a federate leaves because of failure (as opposed to resigning or being forced out in a hot swap), there is a fundamentally unavoidable source of possible inconsistencies, particularly with decentralized coordination. There is no way to ensure that all observers agree on the tag at which the federate left. I suspect we could prove this as a theorem. However, centralized coordination mitigates the risk because, assuming the RTI doesn't fail, then all federates will agree on the tag of the last tagged message sent by a failed federate (but not on the last physical message, but this probably OK with our semantics).

As for the use cases, I would go so far as to say that nearly every distributed application is a potential use case. If you talk to distributed systems people, they put most of their effort into dealing with transient participants in their applications. I would like to see us develop, for example, built-in support for quorum-based agreement, a relaxed (but disciplined) form of consistency that enhances availability.

For a canonical use case, I suggest a chat application. This is easy to build, obviously needs transient participants, and the guarantee we can provide is that all observers see chat messages in the same order. This becomes particularly interesting if you have separate but overlapping chat rooms. Any two observers that are in overlapping chat rooms will see messages in the same order even across chat rooms. This can help establish and enforce causality chains. Our first CAL Theorem paper has such an example, though without the multiple chat rooms, and hence much simpler. A particularly interesting challenge would be to create a decentralized version where there is no single persistent federate. This is not addressed in this PR, but it provides a good starting point.

As a side note, I think that supporting transient federates with decentralized coordination will be easier than with centralized. This PR addresses the harder of the two problems.

Once we extend this to support decentralized coordination, the next natural step would be a fault tolerant RTI. The RTI currently plays no role during execution in decentralized coordination, but with transient federates, it again has a role. This would be a perfect opportunity to realize a leader election schema for restoring a failed RTI. In this case, one interesting twist is that if we find ourselves with a partitioned network, we probably do want two RTIs, unlike the leader-election test case currently in the playground. An interesting question then becomes how to handle repair of the partitioned network.

I also agree with @ChadliaJerad that restoring state during a hot swap should be handled by the application, not by the framework, at least in the near term. However, long term, providing mechanisms for creating snapshots of state would be extremely useful. In all our current targets, none of which use languages with built-in persistent state (like Java), this would have to be done by providing a way for application developers to provide a serialize and deserialize function for each reactor. This is clearly out-of-scope for this PR, but it would make a great project.

0 replies

lhstrh · 2024-02-22T17:56:36Z

lhstrh
Feb 22, 2024
Maintainer

I think there is a big distinction between "burry this work" and "publish about this work but do not mainline the feature because it is experimental." Just because work exists and is interesting (which I agree it is) does not mean that it is ready to be merged; not merging also doesn't imply that we're burying it. If we can't agree about this, then we need to have another hard look at the RFC track we're outlining and re-evaluate our willingness to commit to it, because that effort is explicitly meant to offer a clear process to navigate discussions like the one we're having in this thread. I think that it would be a mistake to ignore that process and forge ahead like we use to, with the criterion that if something is interesting then it must be worth merging.

I also want to emphasize that in the open-source community, there is absolutely no shame in contributions not being mainlined. In fact, the default practice is to fork, and the number of forks of a project is actually seen as an important measure of its success. Usually, external contributors are driven by a specific need for a feature that they choose to develop on their own accord without or any sort of approval or the expectation that it will get merged upstream. If they want to go through the effort of proposing such merge, and if it finally does get merged, then that's great. But if a merge does not happen, it just means that maintenance of the feature befalls on the feature developer rather than the mainline maintainers. I don't think it's helpful to use negatively charged language to describe the latter situation. If anything, we need to be realistic and responsible when deciding what maintenance burden we're willing and able to take on, and this will be critical to the survival of the project.

1 reply

edwardalee Feb 23, 2024
Maintainer

These are all good points. Although the process didn't exist when this effort started more than a year ago, I think this would be a good candidate for our new RFC process (thanks @cmnrd for establishing that!). Does someone want to take charge of opening an RFC?

To maybe help get that started, as I see it, this effort has (for now) very limited scope. Right now, a federation runs as a single monolithic application, albeit split across machines. When the federation starts, all federates that will ever join have to join right at the start. And the assumption is that if any federate leaves the federation, the whole federation should shut down. There are applications for which that design makes sense, but I think there are many more applications that need more flexibility.

The scope of this work, as I see it, is to solve one key problem: How does a federate join a federation that is already running? Solving this requires solving the following subproblems:

Enabling a federation to start without all federates having connected. Which ones are required for starting and which ones are not? And what does it mean for a federate to "absent"?
How do we ensure that when a federate joins, all federates affected by it agree on the tag at which that happened?
How can a federate leave a federation in an orderly fashion, and what is the tag at which that happens? (Disorderly departures will also occur, but I believe that is a separate issue, out-of-scope for this RFC.)

I see the "hot swap" mechanism as a small embellishment that follows almost immediately from having solved 1-3 above.

In particular, what is not in scope for this RFC:

How to ensure that a federate that joins after startup is somehow "valid"? I believe this problem is no different from how we ensure that a federate that joins at startup is valid. We have an authentication mechanism, thanks to @hokeun 's team, which solves a key part of the problem, but it is only part of the problem. A more complete solution is needed both at startup and when any transient federate joins.
How does a "hot swapped" component inherit state from the component it is replacing? This is an important question, but I believe any solution to this problem will be pretty orthogonal to this RFC and could be part of much bigger effort to improve fault tolerance. In the near term, this can be left up to the application designer. They just have to store and restore the state of their reactors.
How to build an LF application that may, for part of its lifetime, participate in a federation, and for other parts of its lifetime, operate autonomously. I think this would be a very interesting thing to do, but it is out of scope for this RFC.

cmnrd · 2024-02-23T13:16:54Z

cmnrd
Feb 23, 2024
Maintainer

I think it would be a mistake to bury this work because it's an incomplete solution. We won't know what a complete solution looks like until we start building applications with a partial solution.

I agree. My main criticism with the proposal, as it is, is that it does not clearly state the problem that it solves. Instead, it sketches problems that we are not even close to solving. I would even go as far as saying that the stated problems are unrelated to the proposed solution. If we had an actual hot-swapping mechanism (like for instance Erlang implements it), then rejoining wouldn't be a problem as even the RTI wouldn't need to notice that the implementation changed. And in the intersection example, it is a requirement (not a design decision) that both the vehicles and the intersection operate independently and continuously. In this scenario, both parties have already started when they meet, and they already have events in the queue. This setting is very different from the one considered in the proposal.

We need to be conscious about the expectations that we set, both internally and externally. If we tell people that LF supports hot-swapping, then we better have a solution that lives up to the user expectations. Otherwise, we will have frustrated users. And internally, I strongly believe that we should evaluate design proposals based on whether the stated problem is relevant, whether the design effectively addresses the problem, and at which costs it does so. The focus should be on the actual problem solved, not on shiny problems it might solve sometime in the feature. Otherwise, we risk being deluded by the promises, and less perceptive to understanding the costs. So if I ask myself if the proposal meets the expectation that it sets, then the answer is clearly no.

This is not to say that there is no value in the proposal. I have my concerns about the concrete integration into the language, but the conceptual considerations on what it means for a federate to (re)join are certainly relevant. What I would like to see is a proposal that clearly states the problem (if possible and applicable based on a use-case), that identifies requirements, that describes the design and how it addresses the problem, and that openly discusses drawbacks and costs. I also think that an RFC would be the appropriate format for this.

Having such an RFC, we could have a more focused discussion on the relevance of the problem to LF, the effectiveness of the solution, and the involved costs. This would then hopefully prepare us to decide if we want to integrate the proposed solution in LF mainline or not.

0 replies

lhstrh · 2024-02-26T03:01:02Z

lhstrh
Feb 26, 2024
Maintainer

The proposal does not discuss any safeguards regarding the behavior of transient federates. If there are none, this effectively means that we compile an LF program assuming a certain interface of the transient reactor. But then we can replace this reactor with an arbitrary implementation at runtime.

I indeed overlooked adding a fixme in receive_connection_information() in rti_remote.c (with the latest PRs) to check if the interface of the new joining federate changed from the previous run or not, or if we decide to support type refinement (See 14.2 Type Equivalence and Refinement in Lee and Seshia...

I mentioned this several times before, but I'll mention it again: Lee and Seshia's definition of refinement is not sufficient to guarantee that a component is compatible with a federation. If a substitution happens with a component that has the same ports but a different internal structure, then this can easily lead to deadlock. I suppose this can be sidestepped by limiting the scope of the mechanism such that it is explicitly disallowed to insert new implementations of components at runtime, but that is not what the title and description of this discussion suggest.

Another (more practical) concern about the feasibility of providing "bug fixes and upgrades" while a federation is running, is that I don't think we actually have a way to independently compile federates currently. This is mostly because the code that each federate compiles down to is very federation-specific.

Therefore, I have to agree with @cmnrd, that we're quite a ways away from offering robust "hot swap" capability, which is a deep topic in its own right, with a number of unaddressed technical challenges, some of which are highly specific to our current implementation. For that reason, I encourage us to narrow down the scope of the discussion, focus on the semantics of "transience," and resist the temptation to consider the contributions in #2213 and lf-lang/reactor-c#358 as potential enablers of lofty goals and complex functionality that to up to this point is merely speculative.

As far as my understanding goes, a transient federate is a federate that:

is not required at startup;
can (re)join at an arbitrary later time; and
can leave at an arbitrary later time (without the federation crashing).

If we can focus on that functionality and forget about everything else, I think we'll have a much better shot at reaching a common understanding.

0 replies

erlingrj · 2024-04-17T14:11:25Z

erlingrj
Apr 17, 2024
Maintainer

I have been thinking about executing mixed criticality systems using federated LF. My conclusion thus far is that different criticality levels should be coordinated decentrally. We must accept that federates crash and that they are restarted an re-join the federation without any issue. I think we could start by answering Marten's question for decentralized coordination:

is not required at startup;
Decentralized coordination could be implemented without any blocking at startup. We could allow each federate to start executing directly after it establishes contact with the RTI. This would lead to the federate not sharing a common start tag, but we seem to agree that this is not that important and that logical simultaneity has to be achieved with actions etc.
can (re)join at an arbitrary later time; and
Straight-forward if we allow non-deterministic start tags.
can leave at an arbitrary later time (without the federation crashing).
This is not a problem with current decentralized coordination.

If we can agree on (1) and (2) we could go quite quickly prototype fault tolerance/dynamism based on decentralized coordination.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime Dynamism in Federated Execution: Transient Federates and Hot Swap Mechanism #2212

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Future Directions

{{title}}

Future Directions

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Runtime Dynamism in Federated Execution: Transient Federates and Hot Swap Mechanism #2212

ChadliaJerad Feb 20, 2024 Collaborator

Motivation

A Sketch of Transient Federates

Challenges to Support Transient Federates (Centralized Coordination)

Axioms to Observe

Identified Challenges

Overview of the Solution Implementation

How to Issue a (P)TAG when a Transient is Absent?

Effective Start Tag of a Joining Transient

How will a Timer Execute in a Transient Federate?

How to Identify Cycles when a Federation Includes Transient Federates?

Hot Swap Mechanism

Testing Transient Federates

Future Directions

Replies: 8 comments · 1 reply

ChadliaJerad Feb 20, 2024 Collaborator Author

cmnrd Feb 22, 2024 Maintainer

Future Directions

ChadliaJerad Feb 22, 2024 Collaborator Author

Future Directions

edwardalee Feb 22, 2024 Maintainer

lhstrh Feb 22, 2024 Maintainer

edwardalee Feb 23, 2024 Maintainer

cmnrd Feb 23, 2024 Maintainer

lhstrh Feb 26, 2024 Maintainer

erlingrj Apr 17, 2024 Maintainer

ChadliaJerad
Feb 20, 2024
Collaborator

Replies: 8 comments 1 reply

ChadliaJerad
Feb 20, 2024
Collaborator Author

cmnrd
Feb 22, 2024
Maintainer

ChadliaJerad
Feb 22, 2024
Collaborator Author

edwardalee
Feb 22, 2024
Maintainer

lhstrh
Feb 22, 2024
Maintainer

edwardalee Feb 23, 2024
Maintainer

cmnrd
Feb 23, 2024
Maintainer

lhstrh
Feb 26, 2024
Maintainer

erlingrj
Apr 17, 2024
Maintainer