Approaches available to provide heartbeat alerts in OMS[Update 11/29/2017: This blog post series has been superseded by a solution built to visualize server and client information which is available at: http://blogs.catapultsystems.com/cfuller/archive/2017/11/28/updating-the-server-and-client-performance-solution-to-the-new-query-language/. Please note that query examples in this deprecated blog post are for the old query language and will not work in the current query language.]
The goal of this series was to provide potential approaches where it’s possible to provide alerting for conditions where servers are offline and not reporting to OMS. For background, the three blog posts are as follows:
- Notifying when a server is offline with OMS using the pre-built report (existing report)
- Notifying when a server is offline based on when agents last added data to OMS (last written data)
- Notifying when a server is offline based on Operations Manager event log entries (OpsMgr event)
During the building of these approaches there were a few “gotcha’s” to be aware of. This blog post will discuss those various conditions for the existing report approach, last written data approach, OpsMgr event approach and some general gotcha’s I have seen while working with these.
- The existing report only sends information once a week about the systems which are not reporting to OMS.
There may be an upper bound of how many systems which are not reporting – my lab environment does not have enough systems added to OMS to validate this one way or the other.
Last written data:
- This approach provides a single notification when the condition occurs and it will not recur. As an example, if an agent was already offline for two hours then the alert rule would not fire.
The alert rule notification is limited to 10 results so if more than 10 systems are not reporting to OMS you will need to open OMS and query to identify the full list of systems which are not reporting.
- This approach requires that the agent is not only reporting to OMS but also is reporting to Operations Manager so this would only work on agents which are either multihomed to OMS and OpsMgr or in environments where the Operations Manager environment is integrated with OMS.
- This approach is designed to work for a single agent versus all agents in the OMS workspace.
This approach will only work if the OpsMgr agent is successfully working in the Operations Manager environment since that is the event which it is looking for in the logs.
It’s important to be aware that both options #2 and #3 are dependent upon data being written into OMS. This means that if either of these conditions occur, you will receive notifications until the issue is resolved:
- If you have a subscription with a spending limit and the subscription runs out these alerts will continue to fire until the issue is resolved as OMS is not writing any data so both the last written data and OpsMgr event approaches will have no data and will therefor fire the alert rule. With OpsMgr event approach which notifies every 15 minutes this can generate a lot of email.
- If you have are using the 500 MB free tier of OMS and you go beyond 500 MB in a day, I expect that the alert rules will fire.
- Internet connectivity: If your systems cannot report to OMS they will be reported in all three of the approaches listed above.
Summary: After working with these three options I believe that leveraging each of these may have their place. The built-in report provides a way to check weekly on the health of the agents that you have in OMS to resolve systems which are not responding. The approach using Operations Manager event entries provides a method which will notify you (and continue to notify you) for a critical system if you already have the system reporting to Operations Manager. The approach using when data was last added to OMS gives a simple non-repeating alert which notifies you of systems which are not reporting.