-
Notifications
You must be signed in to change notification settings - Fork 39
Mismatch between output.json from octorpki and gortr #82
Comments
Did you save the JSON? What was the affected prefix and both ASNs? How long was it running? Was it present on GoRTR's JSON endpoint? |
I was hoping the updated version of gortr would fix this issue, but it did not. We are now running v0.14.6 and it happened again. I did save output.json from octorpki and gortr. There were about ~5000 differences. I am not sure how long gortr had been not serving updated output.json (next time I will, I added that check). Below is an example of a diff for one prefix.
|
An update. It happened again on one of the 8 rpki servers. From what I see in the syslog gortr stopped requesting updates from octorpki. Pease see below, searching "Updated added, new serial"; stopped after 08:09:49 and did not resume until after restart at 13:51:55.
|
Happened once again. Interestingly, happened on the same server as previously, which defies probability. This sever has more router connections than the other's. Is that a clue? |
Are you setting |
Q: Are you setting -useserial parameter? C: It almost looks like it restarted according to the logs. Q: How many sessions are we talking about? Q: What's the version of OctoRPKI? |
Thank you for the feedback. |
We are not tracking the metrics in Prometeus. I was panning to start tracking the metrics yesterday, but did not get to it. I wish I had because it happened again. Curiously again on the same server. Tracking the metrics is the next step an I will get on that. |
I'm currently testing a nagios-compatible monitoring script (checking for RTR serial updates), I will release it in a few days once testing is complete. |
I have been building a small tool that can monitor two servers (json, rtr or both) and outputs a diff and Prometheus metrics. You can run it $ docker run -ti -p 8080:8080 cloudflare/rtrmon -primary.host tcp://rtr.rpki.cloudflare.com:8282 -secondary.host https://rpki.cloudflare.com/rpki.json $ curl --silent http://localhost:8080/diff.json | jq '.'
{
"metadata-primary": {
"last-fetch": 1603930794,
"url": "tcp://rtr.rpki.cloudflare.com:8282",
"serial": 635,
"session-id": 40872,
"count": 191018,
"rtr-refresh": 3600,
"rtr-retry": 600,
"rtr-expire": 7200
},
"metadata-secondary": {
"last-fetch": 1603930793,
"url": "https://rpki.cloudflare.com/rpki.json",
"serial": 0,
"session-id": 0,
"count": 191018,
"rtr-refresh": 0,
"rtr-retry": 0,
"rtr-expire": 0
},
"only-primary": [],
"only-secondary": []
} $ curl http://localhost:8080/metrics
[...]
# TYPE rpki_roas gauge
rpki_roas{server="primary",type="diff",url="tcp://rtr.rpki.cloudflare.com:8282"} 0
rpki_roas{server="primary",type="total",url="tcp://rtr.rpki.cloudflare.com:8282"} 191018
rpki_roas{server="secondary",type="diff",url="tcp://rtr.rpki.cloudflare.com:8282"} 0
rpki_roas{server="secondary",type="total",url="tcp://rtr.rpki.cloudflare.com:8282"} 191018
# HELP rtr_serial Serial of the RTR session.
# TYPE rtr_serial gauge
rtr_serial{server="primary",url="tcp://rtr.rpki.cloudflare.com:8282"} 635
rtr_serial{server="secondary",url="https://rpki.cloudflare.com/rpki.json"} 0
# HELP rtr_session ID of the RTR session.
# TYPE rtr_session gauge
rtr_session{server="primary",url="tcp://rtr.rpki.cloudflare.com:8282"} 40872
rtr_session{server="secondary",url="https://rpki.cloudflare.com/rpki.json"} 0
# HELP rtr_state State of the RTR session (up/down).
# TYPE rtr_state gauge
rtr_state{server="primary",url="tcp://rtr.rpki.cloudflare.com:8282"} 1
# HELP update Timestamp of last update.
# TYPE update gauge
update{server="primary",url="tcp://rtr.rpki.cloudflare.com:8282"} 1.603930794e+09
update{server="secondary",url="https://rpki.cloudflare.com/rpki.json"} 1.603930793e+09 @skakee would you be able to give it a try? |
So I am all for trying out rtrmon to try to troubleshoot the issue, but I would like to use it against production that is running the recommended versions of the binaries. Here's why... So yes, I can definitely give it a try. But before that, I'd like to get on the same page as to what version of octorpki and gortr we should be running. We don't use docker, we use go binaries. Now... by default, octorpki used to bind to port 8080 and gortr also to port 8080 for http, but we changed it at initial installation with "-metrics.addr :8281" option. Then there was a change made where octorpki by default would bind to 8081, not interfere with gortr's default port. This was no issue of us since we were not using the default port 8080 for gortr. So, in our installation octorpki uses port 8080 and gortr 8281, for http. I see that the current binary release of octorpki, v1.1.4, still binds by default to port 8080. Only the pre-release version, v1.2.0-pre, binds to port 8081. If v1.1.4 is "legacy", and only that version is causing the issue (maybe?), there is little value in troubleshooting "legacy" code. Is 1.2.0 stable and should it made available as general release? Similarly, there is now a pre-release version of gortr available, v0.14.7-pre. Again, same argument, if v0.14.6 is soon to be "legacy", is there value in troubleshooting it for this issue? Which versions of binary-for-linux of gortr and octorpki is recommended to be running in production? Also, This issue is very intermittent. Since I first reported it, it occured 4 times on one of the 8 hosts and once on the other. The two hosts are "sisters" in that they are at the same location and same set of routers connect to them. That set of servers had 2,611 connections. Another three sites have set of servers with 2,660, 774 and 825 connections. This issues has not occured in the last two weeks at those location. The reason I am mentioning it is that the test with rtrmon may not reveal anything if I run it in dev with a handful of connections. So I am all for trying out rtrmon to try to troubleshoot the issue, but I would like to use it against production that is running the recommended versions of the binaries. |
Understood. Fwiw: packages and containers should use the same code but they are compiled in different environments which can influence in some cases.
Ah yes, I ended up messing up the default ports, since I had a few feedbacks from people wanting to run this software on the same machine and decided to fix as part of v1.2.0.
There have been a few additions since 1.2.0-pre. It took a while to rollout due to testing.
gortr, rtrdump and rtrmon are part of the same repository. v0.14.7 version only introduce rtrmon and library structure change. There wasn't any new fixes or features for gortr since v0.14.6.
I recommend using gortr v0.14.6 and octorpki v1.2.0.
Thank you! Metrics collected regularly should definitely help the investigation. |
This is a fantastic response. Thank you! I hope you understand I wanted to clear this up as testing legacy stuff would server neither of us. :) I'll deploy octorpki v1.2.0 as soon as GA is available and start trying out rtrmon. Thanks again. |
I see that octorpki is available as of 16 minutes ago. Starting on the updates! |
The plan was to update octorpki to v1.2.0 all the RPKI server then gather stats with rtrmon. I never get got to the second part. Updating octorpki to v1.2.0 caused gortr to keep disconnecting all the router every couple of hours on all the servers. I don't know why. I have the logs from /var/log/messages. I had to roll back octorpki to v1.1.4 on all the servers. Will keep monitoring the situation. |
Did the routers show an error? It could be a wrong PDU. How is the JSON distributed? |
I'm having trouble with
|
Could the |
A good catch. But, alas, that did not fix it. The line now:
|
I just had a look at the code, I missed some parts that handle SSH. |
Can you try |
Thank you! rtrmon-v0.14.7-pre2-linux-x86_64 connected fine using ssh/password. I'll update ocotrpki on one of our 8 RPKI servers. I'll setup periodic (10 min) monitoring of current and updated servers for comparison. Have a good feeling we will nail the issue. :) |
Yesterday, I upgraded octorpki to v1.2.0 on each server one-by-one. There have been no issues at all, which is puzzling. Last Friday, if you recall, I was getting disconnects on all the servers few hours after the upgrade (same version as) which caused me to roll back the upgrades. It's causing me to believe the issue last Friday was something unrelated and coincidental to the upgrade. I will keep monitoring the situation. Also, octorpki v1.2.0 seems to have improved memory management over the previous version, looking at the early results. |
I have a preference for restarting the validator on a clean cache directory. Could be a cause.
I did a change on how it handled file paths, it definitely improved the usage. Our internal validator shows around 2GB of RAM during validation and <300MB the rest of the time). |
@skakee are you still experiencing the issue or should we close this ticket? |
FYI this is here: https://github.com/lukastribus/rtrcheck |
This morning we noticed that one the ROAs was wrong. When I pulled output.json from octorpki (/output.json), it had the correct value. However when I ran rtrdump, it had different ASN value for the prefix. Restarting gortr process did fix it. Sending SIGHUP did not. This issue was noticed t 8:00 AM and persisted until 10:45 when I restarted gortr process. Please note that "-refresh" is set 120.
Can you comment on what could have been the possible problem and how it could have been addressed short of restarting the process?
Thanks!
/opt/cloudflare/go/bin/gortr -metrics.addr :8281 -ssh.bind :8282 -ssh.key /opt/cloudflare/private.pem -ssh.method.password=true -ssh.auth.user REDACTED -ssh.auth.password REDACTED -verify=false -cache http://localhost:8080/output.json -refresh=120 -bind
The text was updated successfully, but these errors were encountered: