Autoscaling, Azure, WAP and System Center Part 2
This blog post series focuses on autoscaling from the perspective of Azure, System Center and WAP. In the first part of this series we discussed how autoscaling works in Azure. In this blog post we will discuss resources which are available to provide autoscaling in System Center.
Part 1: Autoscaling in Azure
Part 2: Autoscaling in System Center [This blog post]
Part 3: Autoscaling in WAP
Let’s start with why to autoscale in System Center:
Auto scaling gives the ability to have your application automatically adapt to the load which is required of it. If you are running System Center on site, you have the framework required to autoscale systems which you are running on-prem.
So, why autoscale in System Center? The primary reason is due to the depth of knowledge on when it makes sense to autoscale a system (either scale up, scale down or scale out). In Azure IaaS and PaaS there are a very limited number of out of the box metrics available when you can use to determine when scaling up or down occurs. With System Center Operations Manager there are an incredible number of performance counters which are gathered out of the box. This allows a better decision for when it makes sense to autoscale an application. As an example, it may make sense to autoscale a web based application if a synthetic transaction is taking longer than is acceptable and processor utilization on the systems running the application are over an acceptable threshold.
How to autoscale in System Center with Hyper-V:
Pete Zerger and I go-authored on a whitepaper on this topic specifically focused on how this works in Hyper-V (ok, realistically Pete wrote all of the good stuff on automation so the big kudos go to him). We also did a webinar on this topic. These are available at:
How to autoscale in System Center with VMWare:
One of my colleagues has been developing an integration between System Center and VMWare to communicate from System Center to VMWare. If/when he provides details on this integration I’ll add a link here to reference it.
Scaling up versus Scaling out:
I had some excellent feedback from colleagues related to the whitepaper discussed above on autoscaling in Hyper-V. One of the points from my friend Tony was to better clarify why you would want to use scaling up, service outages from scaling up and limits to scaling up.
When is scaling up commonly used? Scaling Up is typically used for things that you can predict and monolithic applications.
Scaling up results in a service outage: In a scaling out situation you need to configure thresholds for metrics which and those servers need to be monitored for these metrics. Example metrics may include CPU utilization, memory utilization, disk utilization, or the time to perform a synthetic transaction against the application. In a scale up situation if the threshold surpasses the threshold defined additional resources can be added to the VM or instance in PaaS. To add resources to a VM most likely will require the VM to be rebooted (there is at least one situation where this is not required – if the VM is running Hyper-V and using dynamic memory additional memory can be added without rebooting the VM). Scaling out of additional resources does not require the VM to reboot so this approach avoids the service disruption which will occur during the reboot.
Scaling up is limited by the application’s ability to use resources: As a starting point remember that scaling up is adding resources to an existing VM or instance to meet load demand. An application can only utilize the resources that it is coded to use (as an example, many programs will not utilize multiple processors effectively).
Challenges in monitoring scaled out systems:
I also had some excellent feedback from a reader (Wilson W) with regards to challenges in monitoring both IaaS and PaaS applications in System Center. In OpsMgr we have agents installed on the various VM’s in the environment so if you have an application which may have virtuals which are not always running how do you effectively monitor them? Let’s take an example where I have an application which has been designed to scale out to 10 VM’s but under normal loads only 4 VM’s are running. As far as Operations Manager is concerned there are 10 VM’s so we are getting heartbeat failure alerts for 6 of these systems even when the application is running correctly on 4 nodes. This is "less than optimal". Some of the options to handle this which he discussed included:
- PowerShell initiated Maintenance mode: Using PowerShell and maintenance for these VM’s when they are shut down correctly as part of the scale down process.
- Orchestrator/SMA: Use Orchestrator or SMA to initiate workflows to handle both scale up and scale down situations including putting the scaled down servers into maintenance mode and keeping them in maintenance mode.
- Custom management pack: My gut feeling on this is that we would want to monitor this type of an application with a custom management pack. We use the servers in a distributed application with a custom health rollup which ignores grey state. As long as at least 4 VM’s are online then the health of the distributed application is green. If the health goes to less than 4 VM’s online then an alert is generated. Each of the VM’s in this distributed application would also need to be added to a group which would suppress both the heartbeat and the communication alerts which would normally occur when a VM goes offline.
Developing an approach to effectively monitor applications which are designed for scale out and scale down is an interesting challenge which I hope to have some time to work on and at some point put together a blog post with the solution that we put together for this requirement.