Skip to content

Support

Jeremy Foster edited this page Sep 19, 2024 · 4 revisions

Support Playbook

The following documentation is to assist team members support the solution, specifically during outages.

To support the solution Openshift expertise are required. You will need access to the production environment.

Table of Contents

Various UI Bugs

Any user interface bugs will need to be determine through review of screenshots and or shared screens to determine a way to replicate the issue. Then they can be prioritized and addressed through a user story.

Gateway Error

When the either the Editor or the Subscriber application is unaccessible due to gateway errors, it is most likely the route in Openshift is failing, or Nginx is failing.

Go to the editor and/or subscriber DeploymentConfigs. View each pod and its logs to determine if they are running. If they are running without errors, then do the same for the respective Nginx DeploymentConfigs named nginx-editor and nginx-subscriber. If they are running without errors, then review the Networking Routes in Openshift editor-tls, subscriber-tls. If they are running without errors, then it is most likely a intermittent networking issues that will resolve itself in a few minutes, however you should post an inquiry to the RocketChat channel devops-sos.

Gateway errors can also be thrown by the API. In this scenario the web applications will be working, the user will be logged in, but all requests to the API are throwing errors. View the Statefulset api pod(s) in Openshift to determine what is causing the error. It may be that the API is not able to connect to the database, Elasticsearch, or some other service.

Reports are not being sent out

There are two primary issues that can result in reports not being sent out. The first is if the Scheduler Service has gone offline. The second is if the report has errored out and the Reporting Service has given up attempts to resend.

Review the Scheduler Service pod logs to determine if it has errored out. Currently the Scheduler Service does not have a automatic process to start running again. A catastrophic failure where the pod dies will automatically restart, but not one of the handled issues. To fix these types of common issues simple kill the pod and it will restart automatically, picking up where it left off. If it continues to fail review the logs to determine what the repeating issue is.

Review the Reporting Service pod logs to determine if there are repeating errors. A failed report should be added back to the Kafka queue to be retried. However, it will only do this a configured number of attempts before giving up.

A subscriber is not receiving notifications

Confirm that the subscriber is configured to receive the specified notifications. This is done through the Editor Notification Administration pages, and the Product Administration pages. If they are, then review the Notification Dashboard to see if any notifications were sent out and failed to be sent to the user.

Check if the specified notifications have been sent out to any subscribers. The Notification Dashboard can be used for this, or make a connection to the production database and query the notification_instance table.

One of the common issues is when the notification filter has a value that result in no content being sent out.

Note that if a subscriber is configured to receive emails as BCC or CC it will mean that the email must be sent TO the owner of the report. This makes it difficult to determine if these emails were ever sent to the subscriber.

A subscriber is not receiving reports

Confirm that the subscriber is configured to receive the specified reports. This is done through the Editor Report Administration pages, and the Product Administration pages. If they are, then review the Report Dashboard to see if any reports were sent out and failed to be sent to the user. The Report Dashboard only shows the most recent reports sent out. You can also make a connection to the production database and query the report_instance and the user_report_instance tables.

One of the common issues is when a report fails multiple times it will not be able to send an email to all or any of the subscribers. Review the Reporting Service pod logs to determine what the error is. If it was an intermittent issue, you can use the Subscriber app, impersonate the owner of the report and resend it. It will pick up where it left off and send to anyone who hasn't receive it yet.

Note that if a subscriber is configured to receive emails as BCC or CC it will mean that the email must be sent TO the owner of the report. This makes it difficult to determine if these emails were ever sent to the subscriber.

Content is missing or shouldn't be included in a report

The most common reason for this is the filters (saved searches) used to populate the report sections is not correct. The Subscriber app has the best UI for viewing and debugging reports, however the Editor Report Admin pages provide a few extra tools that can be useful (specifically the Instances tab).

Reports are often linked to the prior instance. An report instance is a new copy of the report. A common configuration is to not include any content found in the prior report instance. If for some reason the report has been generated multiple times in a short period of time, it can result in unexpected content results. More often including content that was in a prior historical report. There is no 'fix' for this because the system will only ever look back one instance.

PostgreSQL failure

Technically the database should continue working even if there is only one node left running, however sometimes there are performance issues when this occurs.

  1. CrunchyDB repo volume has run out of space
  2. CrunchyDB node volume has run out of space
  3. Openshift has failed and all volumes have become corrupted
  4. Database has been corrupted

CrunchyDB repo volume has run out of space

Regrettably there is no way to deal with this issue. At times the repo will randomly and very quickly fill up with logs. All that can be done if possible is to keep increasing the size of the volume. The challenge with this approach is that there is limited space allocated/available and once it's assigned you can reduce it.

CrunchyDB node volume has run out of space

This should not occur if support staff are receiving emails about volume usage percentage. The solution is to increase the size of the volume. If it does occur, just increase the volume size and it should resolve eventually.

Openshift has failed and all volumes have become corrupted

This is the worst case scenario. Stop the whole solution (every service pod). The repo volume is a backed up volume. Request the assistance of the Exchange Lab and they can recover that volume from a prior backup. Then restore the database from a backed up copy. Once the database has been restored, determine if the CrunchyDB cluster is working again. If so start all the services that you stopped. Depending on the timing of all the issues, there will most likely be lost data.

Database has been corrupted

Stop the whole solution (every service pod). The database should be automatically backed up every night. Restore the database from a prior copy. Once the database has been restored, determine if the CrunchyDB cluster is working again. If so start all the services that you stopped. Depending on the timing of all the issues, there will most likely be lost data.

Kafka failure

There are four permanent volumes for Kafka. Theoretically it can recover as long as there is one remaining. However, the system cannot be running when there are less than three. If all four permanent volumes are lost due to a catastrophic Openshift failure, there is nothing that can be done to restore. Stop the whole solution (every service pod). Deploy each Kafka pod, one at a time, wait until the pod is running before starting the next. Use the helper scripts in the root repo to add topics to Kafka make kafka-update.

The loss of Kafka isn't always catastrophic. Generally speaking if there was little or no lag before the failure, then very little will be lost. Perhaps some content, missed alerts, and missed reports. Currently, the primary lag is only related to the TNO History migration. The bigger issues is the time it takes to get everything working again.

Another common failure related to Kafka is caused when Openshift has an outage and all pods are killed at the same time. This results in the cluster having syncing issues that need to be manually resolved.

Elasticsearch failure

There are three permanent volumes for Elasticsearch. Theoretically it can recover as long as there is one remaining. It takes Elasticsearch a few minutes to switch the master node when it fails. If all three permanent volumes are lost due to a catastrophic Openshift failure, there is nothing that can be done to restore. Stop the whole solution (every service pod). Deploy each Elasticsearch pod, one at a time, wait until the pod is running before starting the next. Use the helper scripts in the root repo to add indexes to Elasticsearch make elastic-update. You will now need to re-index all content from the database. There is an endpoint in the API Services area called ReindexAsyncn, the path is /api/services/content/reindex. Changes will need to be made to the endpoint to support re-indexing the whole database, as presently it only will do 500 items at a time. This process could takes days to complete to get all content back into Elasticsearch. You will need to reindex only the last few days first so that the solution can be turned back on.