How Not To Wake Up At 2 A.M.

Autoscaling is the key to resilient carrier-grade virtualized network services

Network support staff dread the 2am phone call. We’re all very familiar with service degradation and outages caused by hardware failures, software bugs, or configuration errors. And we’re also well aware of off-hours maintenance windows that necessitate middle-of-the-night trips to the data center to plug in line cards and then watch and wait to make sure nothing went wrong. Trying to scale a network service this way in the new era of cloud and virtualized network services will be unnecessary and impossible. 2am wake-ups will never end. Without the ability to autoscale a virtualized service across multiple dimensions, we’re not much better off than when we were sending technicians to open data center racks, lift floor tiles, and work on chassis. In fact, it could be much worse.

Autoscaling is one of the key promises of a virtualized service. Whereas a physical chassis requires a person to wake up at 2a.m. to plug in cards, run cables, and reconfigure devices, virtualized infrastructures can automatically deploy and configure resources and networks. Autoscaling can be used to solve a variety of SLA-related problems, from service degradations due to sudden increases in demand to hitless recovery during unplanned outages. Enabling autoscaling makes Virtual Network Services more web-like – more available, more responsive, and easier to manage.

Scaling can be effected at several levels:

  • Adding resources (e.g. vCPUs, memory, VMs) to VNFs, known as VNF scaling
  • Adding VNFs to a running Network Service, known as Network Service scaling

The best method for service providers is to scale the VNF, because it leaves the network (IP addresses, services provided by that IP address, DNS), the VNF configuration, and VNF management largely unchanged. However, this method is rarely used because scaling a VNF is difficult. A fully autoscalable VNF requires:

  • The VNF to be decomposed into individually scalable micro components (almost like micro services) that can be scaled in/out (and possibly up/down)
  • A distributed, fault-tolerant, wire speed load distribution method for high bandwidth network functions – in the physical world, most network function suppliers use a load balancer for this purpose
  • A way of rebalancing the load across the newly added resources so there are no “hot spots” where for example one VM is handling 90% of the load
  • A way to gracefully shut down components in order to scale in (or scale down) the VNF

Re-architecting VNFs in this manner requires significant investment from the VNF supplier, consequently it is only available for a limited set of VNFs.

While scaling the VNF is the best method for Service Providers (SP), the easiest method for VNF suppliers is to scale a NS is by adding more VNFs. This allows the VNF supplier to create simple, static VNFs, but it shifts the burden of scaling to the SP. Every time a new VNF is created, the SP needs to take into account:

  • The IP connectivity for the new VNF
  • What services are provided by the VNF (Firewall, video cache)
  • The management of the new VNF
  • Configuration of the new VNF
  • Reconfiguration of other VNFs within the NS
  • How to load balance between the VNFs; this typically requires an explicit load balancer VNF, which results in another VNF to manage, scale, and make fault tolerant
  • Which sessions are managed by the new VNF (rebalancing)
  • How to shut down a VNF to scale in the network service

It is important to understand that autoscaling is more than just “spin up a new VM and load a new image” and that it requires more than custom scripts to duct tape VNFs together, as these are costly to implement, fragile, and typically valid for only one use case. As shown above, to properly scale a VNF or a NS, you need to also manage and scale the networking as well as manage life cycle related events of the VNF such as rebalancing sessions across the VNF or busy out of sessions during scale in / scale down. All these must be tied into NFV management and orchestration, as MANO is responsible for orchestrating the NS and VNF scaling action.

For these reasons, RIFT.io has created a framework specifically to manage the complex interactions necessary during scaling and healing. The RIFT.ware Autoscaler Framework even enables VNFs that do not support scaling to have autoscaling capability.

The RIFT.ware Autoscaler Framework provides the following capabilities for MANO-enabled autoscaling:

  1. Flexible triggering events: descriptor based, script based, and through integration with third party OSS and Service Assurance systems
  2. Life cycle management of scaling groups (groups of VMs or VNFs that need to be scaled together)
  3. Integration with network orchestration to provide load distribution via the networking layer
  4. Built-in first level autoscaler for simple load distribution based on IP 5-tuple
  5. Configurable life cycle event notification to support VNF, EMS, and OSS pre- and post-scaling actions

Using RIFT.ware, Service Providers can create complex use cases such as:

  • Automated instantiation of new VNFs in response to degradation of user experience
  • Automated recovery during loss of VNF or scaling group
  • Proactive scaling based on capacity planning trends

We’ll never end 2am phone calls. Stuff still breaks that requires human hands to fix. Even when all the logic and critical functionality is in software, there are still spinning disks, cables, and sometimes buggy software. But if we’re really going to move to a software-defined, fully virtualized network architecture, VNF and Network Service scalability MUST be inherent and automatic. Instead of waking up at 2am, scaling becomes an automated morning report to be read over coffee by a well-rested network support crew.