Database issue

Incident Report for LiveChat

Postmortem

On 9th of January the LiveChat application was not responding for about an hour and a half starting 9:22 UTC and ending 10:54 UTC. Users were not able to log in, respond, use API, and contact our support team about the issue via chat.

This was caused by a problem with a query to the database server. This query blocked all operations on DB cluster and caused network partition between all DB nodes. The mechanisms to call the database were working properly, but the database - the endpoint of these calls - was down.

We are deeply sorry for the whole situation, it shouldn’t have happened for sure. This is not the way we want to provide service. We’ve already taken several steps to avoid it in the future. What we’ll do next:

We will increase network throughput between DB servers. The implementation will be done as soon as possible and will speed the whole disaster recovery process by 10 minutes.
We will put sanity checks on the server so no further bad queries could impact the product.

The negative impact of this situation that you experienced is completely on us, but on the slightly brighter side we were able to check our disaster recovery in action - previously such issue would have been followed up with a backup that would load for 4 hours. This time, the crucial part of recovery was completed in 25 minutes and this situation confirmed that we are still able to speed it up.

All measures will be taken, not only to make sure it doesn’t happen in the future, but also to respond to potential reports even faster.

Posted Jan 11, 2019 - 18:06 CET

Resolved

The database server and all components work properly now, the issue has been resolved. Our admins will continue to monitor the service. If you have any questions related to this problem, please start a chat with us at www.livechatinc.com.

Posted Jan 09, 2019 - 14:16 CET

Monitoring

The servers have been restored, we are monitoring their performance now. If you still cannot log into the application, please refresh it.

Posted Jan 09, 2019 - 12:11 CET

Identified

The issue has been identified and we are doing everything to implement a fix as soon as possible to restore the database server.

Posted Jan 09, 2019 - 10:49 CET

Investigating

We've noticed a problem with our database, there might be problems with logging into the application or using its components. Our team is investigating the issue.

Posted Jan 09, 2019 - 10:34 CET

This incident affected: API, Agent apps, and Support chat on www.livechat.com.