Amazon RDS Aurora writer failover questions/looking for advice #3121
Unanswered
cwatton-wolf
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi everyone, looking for some advice from people I hope have similar use cases or at least have more experience than me as I'm fairly weak at this subject.
Currently we run a large number of microservices connected to an Aurora V3 (mysql8) Mysql database cluster with single writer and multiple readers. These microservices utilize pools for connecting to the cluster and connect using the AWS FQDN for the cluster writer. The issues we're having occur during a failover event of the writer where the connections in the pool stay connected to an instance that has now become a reader and don't seem to cycle back out and reconnect. This shows as the following error
Error: The MySQL server is running with the --read-only option so it cannot execute this statement
This has caused us to write a fairly hacky function that runs on a set interval of 30s to get a connection from the pool and call connection.destroy() on it, forcing a new connection to be created when needed. Slowly this cycles out the majority of the faulty connections and restores the service to a healthy state of being able to write to the database again without having to manual restart our services. Unfortunately this also creates a high number of aborted clients on the server and spams the server logs.
We'd like to move away from this solution and implement something more robust and standardized but can't seem to find the right approach. Is it to just handle the errors in code and destroy the connections/pool in the event this occurs or is it better handle with reducing the number of max idle connections on the pool or some other pool/connection setting I'm not too familiar with.
I can see this has been brought up in a couple of issues but nothing has really be agreed upon as a solution. #1103
#2050
Any recommendations or insight into what the right approach is would be hugely appreciated.
EDIT:
Doing some further investigation today we found some weird behaviour differences between just using pool.query and using the pool.getconnection, connection.query, connection.realease flow that we didn't expect during a failover scenario. When calling the methods manually during a failover event where the original writer becomes a reader a few queries will error but ultimately the connections in the pool all cycle out and successfully reconnect to the new writer. When using pool.query the connections never stop trying to connect to the old writer and are stuck in a bad state until the application is restarted.
Beta Was this translation helpful? Give feedback.
All reactions