Yesterday at 10:26 UTC we identified a problem with main database cluster. The dedicated team that handled this incident was making sure that it was solved as soon as possible. No estimation was made at that point due to the impact of the incident - we needed all developers' hands on deck. We managed to find the source of the problem and implement the fix as soon as possible. A direct implication of the issue was a separated case where our webhooks queue was overloaded. It caused a degradation of service for the next couple of hours. The whole situation was stabilised at 4:30 PM UTC.
The issue affected all agents that were trying to log into the application and new website visitors that tried to start the chat.
As a form of protection from these kinds of problems, we implemented additional double-checking procedures connected to all databases deployment and added additional metrics development to improve infrastructure monitoring. Additional safety measures and procedures were added to our roadmap as well.
We understand that the issue had a huge impact on your businesses and we are very sorry that this came from us.