On 9th of January the LiveChat application was not responding for about an hour and a half starting 9:22 UTC and ending 10:54 UTC. Users were not able to log in, respond, use API, and contact our support team about the issue via chat.
This was caused by a problem with a query to the database server. This query blocked all operations on DB cluster and caused network partition between all DB nodes. The mechanisms to call the database were working properly, but the database - the endpoint of these calls - was down.
We are deeply sorry for the whole situation, it shouldn’t have happened for sure. This is not the way we want to provide service. We’ve already taken several steps to avoid it in the future. What we’ll do next:
The negative impact of this situation that you experienced is completely on us, but on the slightly brighter side we were able to check our disaster recovery in action - previously such issue would have been followed up with a backup that would load for 4 hours. This time, the crucial part of recovery was completed in 25 minutes and this situation confirmed that we are still able to speed it up.
All measures will be taken, not only to make sure it doesn’t happen in the future, but also to respond to potential reports even faster.