Using Distributed Applications to generate actionable alerting
One of the biggest tools in the Operations Manager tool belt are Distributed Applications (DA’s). DA’s can be used for a variety of purposes but for this blog article I’m going to focus on a single simple ability of a distributed application and how it can be used to provide more actionable alerts through using them. I’ve created a simple DA below in the authoring console which takes three websites (all on this server just to show an example) and combines them into a single MyWebsite Web Application.
In the Monitoring pane of the Operations Manager console we can see how this appears within the Distributed Applications / Diagram view. Since one of the websites is offline (and Red/Critical as a result), it in turn changes the higher level website icon to Red as well which in turn changes the Distributed Application (MyWebsite) to go to red also. This is the default rollup behavior of Operations Manager.
However, let’s take this example and consider what if this was a production web farm. I really don’t want MyWebsite to go red unless multiple web servers are offline. I still want to get an alert to my developers if one of the web servers is offline so that they can work on it, but in terms of actionable alerting I don’t want a critical alert to be sent to my operations group until the farm is significantly impacted. So my optimal design would make these changes:
1) Only change the health state to red if two of my three websites go offline
2) Create a critical alert if two of my three websites go offline
3) Create warning alerts on the loss of any of my three websites
We can do this within Operations Manager by changing the default behavior of the distributed application. As with other concepts in Operations Manager we can do this via overrides. The difficulty is understanding where the overrides are created and why which is what we’ll tackle in this article. If we open the distributed application in the diagram view, and then right-click on the MyWebsite icon we can open Health Explorer to see a different view of this DA:
This above view shows the pieces of the distributed application in a healthy condition.
The easiest way that I have found to achieve this process is to create the distributed application, cause it to go into a failure condition and create overrides as the error conditions occur. By failing the three websites, the Health Explorer view has changed to what is shown below:
To change default behavior, I want to change the Web Site Component Group Health Rollup. The default is to rollup the Worst state of any monitor as shown below.
To change this we create an override (for the object “MyWebsite Web Application Web Sites” in this case) to alter the behavior as shown below. The first thing that I want to do is to change the rollup health configuration to address the first change discussed above (Only change the health state to red if two of my three websites go offline). For this example we have configured the web application to roll up healthy based upon the Worst state of a percentage option and set the Percentage on the to 50. With three servers this approach will change the health state to critical if more than one of the web sites goes offline (as the percentage of worst state would then be 66%).
Since we are already making an override on this level, we will go ahead and tackle the second change discussed above (Create a critical alert if two of my three websites go offline) by changing the Generates Alert from False to True.
Now that we have made the first two changes of the default behavior we can move onto the third one: (Create warning alerts on the loss of any of my three websites). To do this we are going to need to create an overrides per website which is monitored. The default behavior of each of these three websites is to generate a critical/medium alert if they are offline as shown below:
For our websites we need to change them to generate a Warning instead of a critical, so we create an override (for the object) changing the alert severity to warning as shown below:
As a reminder, we need to do this for each of the different websites in the distributed application (and not create the override in the default management pack of course). Graphically, our override changes and their location look like this:
Now that we have built it, we need to test it. To do this, we restart all of the websites and get the distributed application back to a green state.
Now, we stop one website and we receive a warning level alert that the website is down.
And our diagram view is updated to show that one of the websites is offline.
Next we stop a second website and we receive another warning level alert that the website is down. We also receive the critical alert from the MyWebsite Web Application (the name can be changed to make it more intuitive).
Our health rollup has worked as expected and the alerts which we wanted have been created as well! The critical alert could be routed to one group (Operations) and the warning alerts could be routed to another group (developers or web support). When the websites are brought back online our health state returns to green and since the warning and critical alerts were generated by monitors they are automatically closed when the return to a healthy state.
The concepts within this approach do not just apply to monitoring web farms and if the various websites are up or down. They can be used in any distributed application, and are also commonly used to rollup health when monitoring a website with synthetic transactions from multiple watcher nodes to provide alerting only when multiple watcher nodes find that a site is not available. In a later blog article I will review how we use distributed applications such as this which do not use default health rollups and map them into Savision Live Maps.
Summary: Through the use of distributed applications, Operations Manager can more effectively model the reality of an environment and generate alerts which are more actionable based upon alerting on the health of pieces of a distributed application which rollup based upon health conditions which you implement for your environment.