Debunking a Myth Again: High Availability is Still Not Carrier Grade
Almost exactly a year ago, I wrote a post about the differences between “High Availability” and “Carrier Grade reliability” as applied to telecom networks.
A fascinating white paper has just been published that explores this subject in detail, so this seemed like an appropriate time to talk about some recent discussions about the topic and point interested readers to this new, in-depth analysis.
In the last twelve months, we’ve seen enormous progress around Network Functions Virtualization (NFV). Service providers have moved beyond their initial Proof-of-Concepts (PoCs) and have started deploying NFV in use cases where the technology is ready and the business benefits are clear, like virtualized Customer Premise Equipment (vCPE) functions. At the same time the ETSI NFV initiative has wrapped up phase 1, which was all about defining architectural requirements. It’s moved onto phase 2 with working groups in place to develop detailed specifications and tackle a range of complex technical issues which must be resolved in order to enable widespread deployment of highly-reliable implementations based on open standards at multiple levels. For the service providers that started the initiative, of course, the end goal of all this work remains top-line revenue growth, reduced operational costs and smiling CFOs.
At recent industry events like the whirlwind that is Mobile World Congress and the less frenetic setting of NFV & SDN Summit, we’ve seen industry experts involved in lengthy discussions about the right way to guarantee the reliability and resiliency that are critical for telecom networks. There’s clear agreement in some areas and on-going debate in others. The one point that everyone seems to agree on is that enterprise customers are not going to compromise on their expectations for reliability in the services that they pay for. With traditional network infrastructure based on physical equipment, service providers have set a standard of five-nines (99.999%) reliability for the services that they deliver (at least in the case of services and customers where this is important). They sign strict Service Level Agreements (SLAs) with their enterprise customers, guaranteeing five-nines reliability for a defined set of services. These SLAs render them liable for significant financial penalties if that level of uptime is not maintained.
Even with today’s highly-reliable physical infrastructure, service providers worldwide still experience enough service outages that, according to a recent Heavy Reading report, downtime costs them between 1% and 5% of their revenues, equating to around $15B annually across the industry.
(The average consumer, of course, doesn’t have the benefit of these stringent SLAs. You and I just curse, redial or reconnect and, if the problems are frequent enough, switch to another provider in the vain hope that they will do better. But that’s another story…..)
The interesting debate is around how to maintain this level of service uptime over networks that are based on NFV.
One school of thought says that the solution is Application-Level High Availability (HA). This concept places the burden of ensuring service-level reliability on the applications themselves, which in an NFV implementation are the Virtual Network Functions (VNFs). If it’s achievable, it’s an attractive idea because it means that the underlying NFV Infrastructure (NFVI) could be based on a simple open-source or enterprise-grade platform.
Even though such platforms, designed for IT applications, typically only achieve three-nines (99.9%) reliability, that would be acceptable if the applications themselves could recover from any potential platform failures, power disruptions, network attacks, link failures etc. while also maintaining their operation during server maintenance events.
Unfortunately, Application-Level HA by itself doesn’t achieve these goals. No matter which of the standard HA configurations you choose (Active / Standby, Active / Active, N-Way Active with load balancing), it won’t be sufficient to ensure Carrier Grade reliability at the platform level.
In order to ensure five-nines availability for services delivered in an NFV implementation, you need a system that guarantees six-nines (99.9999%) uptime at the platform level, so that the platform can detect and recover from failures quickly enough to maintain operation of the services. This implies that the platform needs to deal with a wide range of disruptive events which cannot be addressed by the applications because they don’t have the right level of system awareness or platform management capability.
From a business perspective, this is a critical concept. NFV is supposed to result in improved top-line revenues for service providers, but if we deploy NFV on network infrastructure that isn’t reliable enough, revenues will actually be impacted because both SLA penalties and the costs of dealing with outages will increase. That’s not the way to get to the end goal of smiling CFOs that we mentioned earlier.
For anyone involved in architecting, developing or deploying any part of an end-to-end NFV solution, this new white paper “NFV: The Myth of Application Level HA” is required reading. It provides a detailed technical analysis of the tradeoffs between Application-Level HA and Carrier Grade platforms and gives us all a clear direction to follow.