As an OEM software vendor providing technology to integrators and carriers, failsafe systems are our lifeblood. This gives us some unique insights into how to achieve high uptime in reality. Traditional strategies for High Availability (HA) revolve around redundancy at a process, VM and physical infrastructure level to minimise risk of outage, and minimise recovery time when outage occurs. But how often does the server/virtualisation stack fail? Let's take some examples from our own experience.
Sytel has around 30 rackmount servers in the computer room at our R&D labs. We manage them well and over the last 15 years we have had:
Downtime virtually zero. Maybe good management on our part but really an indication of what can be achieved with hardware.
On the software side; yes you can set up services and VMs to fail over to a redundant backup, but unless your contact center software stack does this for you automatically, the management overhead of setup and the likelihood of misconfiguration, virtualisation-based HA is just a fig leaf.
There is another reason why the holy grail of 5 nines uptime can be difficult to achieve with software. If the software itself is well-proven and reliable, the biggest sources of real-world failure in large-scale contact center systems are:
HA solutions per se don't solve these problems. But there are three things you can do, in particular, to mitigate the problems that arise here:
If you ensure your platform does these three things (and destruction test to make sure that it does!) then this will get you much closer to 5 nines uptime, rather than just implementing redundant services, virtualisation-based failover and redundant hardware.
It's not received wisdom, but it's definitely common sense!