-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mobile agent not receiving credentials after reconnecting to the mediator #1570
Comments
If your mediator is not connected to a persistent database, you'd lose all connections / messages when it restarts. But that doesn't seem to be the problem here. Have you looked at the logs whether the messages are being queued or sent directly to the device? In theory when using pickup v2, the messages sent by the mediator should be acked by the client, and then they'll be removed: https://github.com/hyperledger/aries-rfcs/tree/main/features/0685-pickup-v2#messages-received @genaris is it correct this is how this is implemented? If you can provide logs of both sides, and analyze them we can probably figure out a bit more on what the underlying issue is here. |
Hi, thanks for your kind proposal to help. In the mobile log, you'll notice that the invitation (around 12:26:02) goes well. In the mediator log, you'll notice that in the meantime, the mediator receives 3 'forward' messages, at 2023-09-06T10:26:07.541Z, 2023-09-06T10:26:22.418Z and 2023-09-06T10:26:26.049Z. The logs don't say whether these were forwarded, stored or queued, but since I'm not seeing any "Send outbound message" for these in the mediator log, and not seeing any corresponding "Received message with type" in the mobile.log, I'm assuming they are 'stored' somewhere... I hope this helps. Eric |
@TimoGlastra yes, it works like that but in this case it seems to me that the problem is related to the way WebSockets are being managed. Looking at the logs it seems to me that the, at the moment of processing the first forward message (2023-09-06T10:26:07.541Z) mediator still sees the WebSocket as opened, so it sends it through the existing So when the mediator sends the Forward message to that 'phantom' WebSocket, it thinks everything is fine and considers it has delivered the message (no need to queue it anywhere). However, the app won't receive any message and will not have any way of retrieving it, making the message effectively lost (which is certainly terrible). Why it does think a WebSocket is opened when it is not? It can be caused to a number of reasons that I've identified based on the experience I had with mobile apps and AFJ in both sides:
In practice, most of these issues can be solved or reduced by implementing at mediator level a heartbeat mechanism where it does a ws.ping every n seconds (we found 3 to be a good number). This implies some updates to But I think that at some point it will be needed in Live mode to stop using regular "Forward" messages (which don't require any acknowledgement) and use Delivery instead, which as @TimoGlastra mentioned do require confirmation. See the discussion here: hyperledger/aries-rfcs#760 I'm testing a mechanism like that, which for the moment is a bit tricky to 'standardize' but I think can be useful to add to AFJ in the next major version (as it will need to break the very dated MessageRepository API). |
Hi, thanks for that. I did notice that the first message occurred during a gray area. But what about the 2 other ones ? Re a switch from "Forward" messages (which don't require any acknowledgement) to Delivery, is that a change within the mediator itself ? or does it require a change in the way other components (such as mobile or aca-py) talk to the mediator ? I noticed that "forward" messages have no identified sender, so I'm assuming the sender is the mediator itself, is that a valid assumption ? |
Hi again, FWIW, I'm posting another set of logs, with no gray area. In this session, you'll notice the following: at 11:52:23.571Z, the mediator receives a forward message at 11:53:02, mobile transports are shutdown in the meantime, the mediator receives 6 forward messages: But none of these make it to mobile.
Needless to say, this is very concerning. |
Alternately, it seems from https://github.com/hyperledger/aries-rfcs/tree/main/features/0685-pickup-v2#messages-received that forwarding messages will only be attempted if live mode is turned on. Would it make sense to turn live mode off from mobile before switching to background ? How can we do that ? |
I think it is the first one, because when mediator queues a message it does print a So IMHO the mediator agent considers there is an opened WebSocket for the mobile agent and that's why it forwards the messages. It does not receive any error from lower level layers so it does not throw any error. Based on the mediator log from the time you mentioned the mobile agent transports were stopped and the app went background, I see (leaving only relevant parts):
And also the mobile log at that particular time (ordered backwards): "2023-09-07T11:53:02.875Z",,"""com.getpocketdev""","Successfully connected to WebSocket wss://dev.getpocket.io/aries_js" It seems to me that the mobile agent tried to re-connect to the mediator as soon as you stopped the transports for the first time (11:53:01.786Z), and it didn't actually closed the socket when stopped the transports the second time and it went background (you can see a This makes me think that the resulting WebSocket was not closed gracefully, so the mediator considered it to be opened until 11:53:34 (around 30 seconds after), when it received two WebSocket closed events. We don't have further information, but maybe those correspond to both the mobile agent and the agent you are using to send messages to it. It might be also possible for the WebSocket to be actually opened and the connection between mobile device and iOS to be responsive. To make sure about that, as I suggested previously, I think we can use a heartbeat mechanism in the |
Thanks for the analysis |
I don't see
Not sure if you are referring to the same thing as I do: the |
Thanks for this. |
I also notice on mediator side, that in WsInboundTransport, there is no error handler installed when the socket is connected (line 34), and that send() doesn't check for errors either (line 96). Is this intended ? Would it make sense to check for errors on send() and raise the corresponding error if it failed ? (happy to try that out myself) |
The heartbeat mechanism will not prevent all messages losses but certainly can help to minimize this kind of issues. That said, in my opinion we should require application-level ACKs so we won't depend on any issue that may arise at transport layer. We can either leverage DIDComm 'please-ack' decorator (and ack message) or use pickup protocol's Delivery message that embeds its own way of doing acknowledge. This will need some effort because we'll need to handle a sort of message queue for the 'Live Mode' (which, BTW, is not currently fully compliant with what's stated in pickup protocol 2.0) so the mediator can track and use to retry deliveries.
I don't think this is intended. Actually it is possible that the callback was not available or working like that at the time it was written (recently we've upgraded I didn't try it by myself yet but I think it will be a good idea if you can test it in a custom inbound transporter. Hopefully it will catch an error and help us to progress on the fix needed to make mediator side more robust. |
Noted, I'll try that out and keep you posted |
Hi there, After experimenting (outside aries) with iOS and react-native, and improving logs on both sides, we've narrowed the disaster scenario as follows:
This is disastrous for fire-and-forget protocols, such as Forward messages. The 'solution' proposed above in WsInboundTransport.txt consists in keeping the websocket alive by sending pings at regular intervals. This may be viable for server-2-server communications (where the number of peers would be small) but is definitely not for mobile agents, because the server would end up keeping alive thousands of websockets for inactive mobile devices (until the user actually exits the mobile app). For now, we have implemented a workaround, which is to, rather than send a message blindly, send a ping first, and wait for the corresponding pong before the sending the message itself. We'll see how it works, but we already know it is not bullet-proof because the mobile app could very well become inactive between sending the pong and receiving the message. By nature web sockets are unreliable so I believe that using fire-and-forget protocols over websocket is likely to frequently fail. I am impatient to see how #76 evolves, and when it becomes available in aries-framework-javascript ? |
Just reading this last note — not the whole stream. Indicio has implemented SocketDock that I think addresses the questions raised here. Perhaps it can help. Quick overview — SocketDock is a component to handle web sockets in front of a scalable component (say, Aries) that can then just use HTTPS to send messages reliably. |
Thank you @ericvergnaud , your findings are certainly very, very valuable. It is true that the solution of ping-pongs work well in certain scenarios (like agents going offline) but not the one you mention about apps going background and then foreground in iOS. Regarding this, I have a question: even if you stop the WsOutboundTransport before the app goes background iOS keeps the WebSocket opened? So there is no way to actually forcing it to close in this platform?
100% agree (especially the one about being impatient 😝). I'm working on it and hopefully will be integrated in the next major release (0.5.0). |
@genaris we've tried "stop the WsOutboundTransport before the app goes background". It didn't work well because the underlying socket wasn't getting closed deterministically (remember me mentioning that earlier ?). It could be because transport.close is async, so its execution is queued in the event loop, or it could be that a new socket was opened by another thread (this isn't single JS thread V8!). |
@swcurran thanks for that. Looking into it |
I don't think the async is a problem, as long as you are calling it like I know this is does not make sense, as when we stop a transporter one would understand that no new sessions may be opened, but for the moment it is working like that. A way to get around this issue can be to call |
@swcurran at first glance in seems the product is meant to fill a gap for backends without WebSocket support, but they don't say anything about WebSocket transport reliability... |
@genaris thanks for this, we'll try it out |
Yes, I think this is how we should solve this. This also makes it agnostic of the transport being used, and means it won't be impacted by any transport level issues. We should probably:
Both sides make sure in this case that messages are not lost, but also not processed multiple times. |
@TimoGlastra indeed the receiver needs to guarantee idempotence i.e. the same message received twice should only be processed once. IMHO this should be managed by the framework. |
@genaris whilst trying out your suggestion we've also tried ping-pong-ing the receiver before sending the message. For those interested, here is our code change in WsInboundTransport.ts:
To avoid confusion, I HATE this code :-) |
Any reliable solution we got on this? For Workaround, we did stored connection request message in connection itself, same for credential offer message as retry mechanism. Whenever new invitation is scanned, we are checking if there is some connection id for that invitation and connection is not in response_received, we are retrieving that connection request message and sending again to issuer. Same thing is handled on issuer side if connection is not in acked state, resend response back. There were some changes done in connection request, connection trust ping, credential offer accept flows for this. Can I raise PR for these changes? |
Hi there,
we are using aries-framework-javascript in 2 instances:
Credentials themselves are issued using aca-py.
Everything works fine when the mobile app is active.
As you're probably aware of, iOS inhibits network connections when an app goes to the background.
Our understanding is that the mediator would store the messages received when the mobile app is unreachable, and would send them once the mobile app reconnects.
We're seeing a consistent scenario where credentials transmitted to the mediator through messages of type "did:sov:BzCbsNYhMrjHiqZDTUASHg;spec/routing/1.0/forward" are never delivered to the mobile agent.
When the mobile agent reconnects, it sends a "https://didcomm.org/messagepickup/2.0/status-request" message to the mediator, which responds with a "https://didcomm.org/messagepickup/2.0/status" where where message_count = 0...
Since ensuring consistent async delivery of messages is the primary purpose of the mediator, it seems we are missing something, but what ?
We're using 0.4.0 on mobile side and 0.4.1 on mediator side (running the provided Docker).
Worth noting, our mediator is not connected to a persistent database.
Help would be appreciated.
Thanks
The text was updated successfully, but these errors were encountered: