Using OMS to visualize server health based on memory utilization

[Update 11/29/2017: This blog post series has been superseded by a solution built to visualize server and client information which is available at: http://blogs.catapultsystems.com/cfuller/archive/2017/11/28/updating-the-server-and-client-performance-solution-to-the-new-query-language/. Please note that query examples in this deprecated blog post are for the old query language and will not work in the current query language.]

 

This blog post is part of a series where we will look at Key Performance Indicators for servers and how OMS can be used to work with these KPI’s to determine the health of a server. The blog post series includes:

In Operations Manager we determine the health of a system from a memory perspective based upon the processor utilization level and the processor queue length. Details on this are below from the article on Operations Manager Key Performance Indicators available from Windows IT Pro:

The Percentage of Committed Memory in Use monitor changes the server’s health state based on the percentage of memory committed on the system. This monitor’s healthy and critical states are defined as follows:

  • Critical state occurs when committed memory is greater than 80 percent for 6 minutes (after three samples on a 2-minute schedule).

Operations Manager also monitors the amount of memory still available on a server. The Available Megabytes of Memory monitor changes the server’s health state based on the number of available megabytes of memory on the system. This monitor’s healthy and critical states are defined as follows:

  • Critical state occurs when the available megabytes of memory falls below 2.5MB for 6 minutes (after three samples on a 2-minute schedule). By default, this value occurs only if a system is truly critical on memory. You might need to override the default value and set it to a larger number depending on your environment’s requirements .Figure 4 shows performance monitoring for a server that’s almost critical on memory but isn’t yet close to the 2.5MB default threshold. To better use this monitor, you should create an override to increase the threshold from 2.5MB to a larger value based on the amount of memory on the server. According to the TechNet article “System Level Bottlenecks“, a consistent value of less than 20 to 25 percent of installed RAM indicates insufficient memory.”

 

Adding performance counters to OMS:

In OMS, it is simple to add both of the memory counters. To add these counters we go to the Settings page, on the Data tab and then open the Windows Performance counter section.

In Operations Manager each of these counters are checked on a two minute cycle. To match that cycle we would want to change the sample interval shown above from 300 seconds down to 120 seconds. If these counters are not currently added they can be added and then use the Save option in the top left corner of the UI (shown below):

 

Developing queries for the counters:

Next we need to develop the queries which we will use for the memory information. The example below shows both of the relevant metrics for each of the systems which they are being collected for in my labs within this OMS workspace.

Type=Perf (ObjectName=Memory)

Once we choose metrics we can see the two counters (Memory\Available Mbytes, and Memory\% Committed Bytes in Use).

To see how frequently these counters are being gathered, change the time to show the last 6 hours.

At this size it is possible to highlight individual data points and see that they are in fact being gathered every 5 minutes (300 seconds) or every 2 minutes (120 seconds) or whatever you have set the frequency to for the data collection. An example of how these data points can be highlighted is shown below in the yellow circled area.

Next we restrict the data to the appropriate time range. For our example we are going to watch these counters for a total of 8 minutes (giving us at least 3 performance samples since they are collected every 2 minutes).

 

Querying Available Mbytes:

The queries below give the Available Mbytes memory counter information based upon the last 8 minutes of time based on the highest value in the timeframe specified.

Type=Perf ((ObjectName:Memory AND CounterName:”Available MBytes” )) AND TimeGenerated>NOW-8MINUTES | Measure Max(CounterValue) as Counter by Computer

This unfortunately (currently) does not result in any data being returned. Later in this article we will review the reason behind that situation (see the “Important Note” section of this blog post for details).

If we move the timeframe up to 1 hour we see the results that we would expect.

Type=Perf ((ObjectName:Memory AND CounterName:”Available MBytes” )) AND TimeGenerated>NOW-1HOUR | Measure Max(CounterValue) as Counter by Computer

 

Querying % Committed Bytes in Use:

The queries below give the % Committed Bytes in Use memory counter information based upon the last 8 minutes of time based on the lowest value in the timeframe specified. There is one value available based on the query but it does not represent each of the systems in the environment. Later in this article we will review the reason behind that situation (see the “Important Note” section of this blog post for details).

Type=Perf ((ObjectName:Memory AND CounterName:”% Committed Bytes In Use” )) AND TimeGenerated>NOW-8MINUTES | Measure Min(CounterValue) as Counter by Computer

Type=Perf ((ObjectName:Memory AND CounterName:”% Committed Bytes In Use” )) AND TimeGenerated>NOW-1HOUR | Measure Min(CounterValue) as Counter by Computer

 

Saving the searches:

Now that we have our working queries, most of the hard work is done. We can now visualize this information in the My Dashboard page. To do this we save our queries under a category which we will use throughout this blog series “Server Health”.

 

Adding dashboard items:

To add dashboard items, go to the top page for OMS and choose the My Dashboard option.

Click the Customize button to add a new dashboard item.

Add the query that you created:

As dashboard items. An example is below:

Click customize again to save the new dashboard items in place.

 

Creating alerts

And we can alert on this query to indicate when we want to provide a notification that the server is unhealthy. If we match the Operations Manager approaches this would be when % Committed Bytes in Use is > 80% and when Available Mbytes is less than 2.5 (we’ll round to 3 for an example). For details on how to enable alerts see the following blog post in the “Enabling the alerting preview” section: http://blogs.catapultsystems.com/cfuller/archive/2016/01/26/notifying-when-a-server-is-offline-based-on-when-agents-last-added-data-to-oms/.

 

% Committed Bytes in Use:

It’s critical to understand how alert rules actually work in OMS, and it’s not necessarily intuitive. Let’s go through an example. If I use the query which was put together for the dashboard on % Committed Bytes in Use:

Type=Perf ((ObjectName:Memory AND CounterName:”% Committed Bytes In Use” )) AND TimeGenerated>NOW-1HOUR | Measure Min(CounterValue) as Counter by Computer

The result of the query is NOT 84.9 and 48. The result of this query is 2 because it is returning the number of the results, not the number of the counters. At first glance (ok, maybe first, second and third) this seems to negate the ability to use queries to write alert rules but actually it does not.

The key to using the alert rules (at least currently) is to have a different version of the query where the answer of 1, or 2 or whatever is sufficient. Let’s take the example above and re-write the query for use in an alert rule. To do this we remove the Measure Min portion, and we add the criteria that we want to define for the counter. I am also removing extra ( and ) to clean it up a bit.

Type=Perf (ObjectName:Memory AND CounterName:”% Committed Bytes In Use” AND TimeGenerated>NOW-1HOUR AND CounterValue > 80)

The query above gets the correct object (Memory) using the correct counter (% Committed Bytes in Use) limited to the last hour of data gathered, and only adds that value if it is greater than 80 which is our threshold for the % Committed Bytes in Use counter. A good trick to remember is to save these queries to your favorites so that you can find them later. In this case something like “% Committed Bytes in Use – Alert” to make it easy to find.

To configure this alert we use the following alert rule:

memory02

 

Available Mbytes:

Let’s take the Available Mbytes query and go through this same process. For the dashboard we had the following:

Type=Perf ((ObjectName:Memory AND CounterName:”Available MBytes” )) AND TimeGenerated>NOW-1HOUR | Measure Max(CounterValue) as Counter by Computer

Once we clean it up, remove the measure max section and add the threshold (3) it now looks like this:

Type=Perf (ObjectName:Memory AND CounterName:”Available MBytes” AND TimeGenerated>NOW-1HOUR AND CounterValue < 3)

For this one we save the search as “Available Mbytes – Alert” to make it easy to find.

And our alert rule activates when there are one result or more for this condition.

memory03

 

 

Important note: Based upon my test results and my communications with a few of the other CDM focused MVP’s it appears that while we can currently gather data in near real time (NRT), the queries are not able to be used to reseprent that data. The queries represent the data which is available once it has been indexed which occurs every 30 minutes. So why did I set the query to every hour instead of every 8 minutes? This period of time provides at least two data points which will exist after aggregation (one every 30 minutes). So in reality the approach in this blog will provide notification when memory is insufficient for a server over a one hour period and it will notify you every 15 minutes when these conditions apply.

 

Summary: The approach explained in this blog post shows an example of how OMS can be used to provide similar functionality in terms of server monitoring to what we have been working with in Operations Manager. Once the queries can be updated to reflect data gathered on a smaller time increment this will be extremely similar to the functionality which we are used to within Operations Manager.

Thank you to Tao, Stan and Pete for their knowledge of how near realtime performance counters work in OMS. This was invaluable to my work on this approach!

In the next part of this blog series I will be looking at server health from a free disk space perspective in OMS.

5 Comments

  1. Steve L. November 2, 2016
    • Cameron Fuller November 2, 2016
  2. Loulou March 20, 2017
    • Cameron Fuller March 27, 2017
  3. Mehmet Eser July 7, 2017

Leave a Reply