Creating next generation queries for CPU and memory KPIs in Log Analytics

Recently we had a requirement to provide more than basic CPU threshold queries for Log Analytics. We have been watching the upcoming dynamic threshold functionality to see if this will cover what we need. However, this appears to only be available for systems running in Azure. For our on-prem systems, we have developed the following queries to provide an alert when any server is over or under a specific threshold, a specific percentage of the instances over a specific timeframe. Examples:

  • Notify when a server is over 90% CPU for more than 70% of the past 10-minute timeframe.
  • Notify when a server is over 95% CPU for more than 99% of the past 60-minute timeframe.
  • Notify when a server is under 600 Mbytes of available memory for more than 90% of the past 60-minute timeframe.

This blog post will show sample queries around CPU and memory thresholds for virtual machines, however the queries can be used for any performance counter in Log Analytics.

Monitoring Processor Health

If we want to look at the CPU usage for a system, we can use a query like this one which shows how a specific system’s % CPU looks over the last hour for each instance of the counter for that system (0, 1, 2, 3, _Total)

Perf
| where CounterName == "% Processor Time"
and TimeGenerated > ago(AssessTime) and Computer contains
"XYZ"

If we render this data as a Stacked Column by the InstanceName we see the following results:

Below is the query for the Processor or % Processor Time counters. This query looks at the “Processor” or “%Processor Time” counter and sees which computers have a value of more than 90% over the last hour for more than 99% of the time.

Next Generation CPU query

let AssessTime = 60m;
let CounterThreshold = 95;
let CounterThresholdPct = 99;
Perf
| where (ObjectName == "Processor" or ObjectName == "System") and CounterName == "% Processor Time" and TimeGenerated > ago(AssessTime)
| summarize CpuOverLimit = dcountif(CounterValue, CounterValue > CounterThreshold), PerfInstanceCount = count(Computer), PctOver = round(todouble(todouble(((dcountif(CounterValue, CounterValue > CounterThreshold)*100))/todouble((count(Computer)))))) by Computer
| where PctOver > CounterThresholdPct

The query in this blog post can adapt based on any of the configurations that you are looking for. The format is:

  • AssessTime = How long of a timeframe (10 minutes or 60 minutes in the examples above)
  • CounterThreshold = What is the threshold for the counter we are watching (90% CPU, or 95% CPU in the examples above)
  • CounterThresholdPCT = What percent of the time does the threshold have to be above the CounterThreshold (70% or 99% in the examples above)

A sample result set is shown below (with CounterThreshold and CounterThresholdPct updated so there is sample data):

This query approach only alerts when a counter is above a threshold for a percentage of the data points over a specified timeframe. This should result in a much more targeted alert – IE: When is my CPU really a bottleneck.

Monitoring Memory Health

If we want to look at the memory usage for a system we can use a query like this one which shows how a specific system’s available memory looks over the last hour. We can see that the available memory is consistently less than the threshold of 700.

Perf
| where CounterName == "Available Mbytes" and TimeGenerated > ago(AssessTime) and Computer contains "XYZ"

Below is a variation of the query above re-written for the available memory counter: (changes compared to the first query are in Bold below). This query looks at the “Available Mbytes” counter and sees which computers have a value of less than 700 Mbytes over the last hour for more than 90% of the time.

Next Generation memory query

let AssessTime = 60m;

let CounterThreshold = 700;

let CounterThresholdPct = 90;

Perf

| where CounterName == "Available Mbytes" and TimeGenerated > ago(AssessTime)

| summarize MemoryUnderLimit = dcountif(CounterValue, CounterValue < CounterThreshold), PerfInstanceCount = count(Computer), PctUnder = round(todouble(todouble(((dcountif(CounterValue, CounterValue < CounterThreshold)*100))/todouble((count(Computer)))))) by Computer

| where PctUnder > CounterThresholdPct

A sample result set is shown below:

This query approach only alerts when a counter is below a threshold for a percentage of the data points over a specified timeframe. This should result in a much more targeted alert – IE: When is my memory really a bottleneck.

Summary: The sample queries in this blog post (see the “Next Generation CPU query” and “Next Generation memory query” sections for the queries) should provide extremely actionable alerting for these two KPI’s for servers. Additionally, these queries can be used for any performance metrics which you gather into Log Analytics!

P.S. I owe a huge shout-out to Thomas Forbes for his development of the CPU query contained in this blog post. Way to go Thomas!

Leave a Reply

x

We use cookies to ensure the best possible experience on our website. Detailed information on the use of cookies on this site is provided in our Privacy and Cookie Policy. Further instruction on how to disable our cookies can be found there.