Gotchas while using OMS for server key performance indicators[Updated 11/29/2017: for the new query language for Log Analytics]
I have written several blog posts to assist with providing methods to visualize server health in Microsoft OMS (www.microsoft.com/oms). These include:
- What could server health monitoring look like in OMS?
- Using OMS to visualize server health based on processor utilization
- Using OMS to visualize server health based on free disk space
- Using OMS to visualize server health based on memory utilization
- Approaches available to provide heartbeat alerts in OMS
Today’s blog post will wrap up that topic with a brief discussion on any “gotcha’s” that I ran into while working with these approaches. These include: false alerts due to no data collected, crossing the 500 mb/day limit on a free tier workspace, and an Operations Manager management group which had issues reporting data into OMS.
False alerts due to no data collected:
If data is not collected during the query timeframe, the dashboards in these blog posts will show a value of 0 which will cause the alert to fire. Since the number alerting was 0 it was less than the value specified in the alert.
By using a query like either of these we can see that no data got collected during that hour…
Because no data was collected this caused an alert to fire as shown below.
Be aware that if no data is collected by OMS and you have thresholds which alert based upon low value conditions (such as the one above) you will get false alerts when no data is gathered.
Crossing the 500 mb/day limit
If you have the free tier of OMS it has a built-in 500 MB/day data limit. A sample usage going beyond that 500 MB level is shown below.
If you cross your 500 MB of data per day you can upgrade to a higher tier of OMS (standard or premium). This is done in the top right corner of the OMS UI:
In my case I assessed that the security data I was gathering was the root cause to why I was exceeding my daily limits (see the graphic below). Need to limit my security data usage. See this blog post for how I went about it: http://blogs.catapultsystems.com/cfuller/archive/2016/02/19/an-example-of-how-to-filter-what-security-data-goes-to-oms/
Management group issues reporting to OMS:
I ran into a situation where one of my management groups which was connected to OMS was no longer reporting data into OMS. This could be checked on the connected source tab for settings you can see when data last flowed into the system from either direct connected agents or from a management group connected to OMS. In the example below the Operations Manager environment had not provided any data in 2 hours.
In this case my Operations Manager environment was not responding and needed to be restarted. After reboot the management group was not reporting data as shown below.
In theory you could alert on this condition by using the query identified above and alerting if that from the management group has not been recorded in the last hour.
Started with this:
search Computer !in (“-“) | summarize LastData = max(TimeGenerated) by Computer | where LastData < ago(1h)| count
If the query is < 1 then no data is being written by this management group into OMS and an alert could be generated to notify of that condition.
Summary: OMS provides a lot of capabilities which will lend themselves well towards providing server monitoring from a health perspective. The links at the top of this blog post provide examples of how these can function. When implementing solutions like this be sure to keep an eye out for conditions where data fails to flow to OMS as these may trigger unexpected alerts depending on how you defined them.