Digging deeper into the Windows Server management pack: Tuning the Total Percentage Interrupt Time is too high monitor
The “Total Percentage Interrupt Time is too high” monitor was generating a significant number of alerts on a specific system in our environment. The server causing issues was a backup server which has local and remotely connected backup components which it coordinates. For background, interrupts indicate how much of the time the processor is addressing handling interrupts. The following is a subset of information from: http://technet.microsoft.com/en-us/library/cc750586.aspx and http://technet.microsoft.com/en-us/library/cc768048.aspx
# Interrupt tuning: Search for Wasteful Hardware Components
For enterprise servers, do not be surprised if Processor: % Interrupt Time is around five to twenty percent of the total CPU workload, particularly if the network and disk I/O is very heavy. There are, however, some components that behave better than others. Attempt to obtain NICs that truly support bus mastering and disk host bus adapters that support DMA transfers versus PIO. Some disk and network adapters operate more efficiently than others, thus they require less CPU cycles to operate. Trade magazines provide numerous comparisons of these products. One good source of information is the NT Magazine web site located at: http://www.ntmag.com.
Remove Faulty Hardware Components
When a hardware device, such as a NIC, interrupts the processor, NT Server’s Interrupt Handler will execute to handle the condition, usually by signaling I/O completion and possibly issuing another pending I/O request. Observe the Perfmon counter Processor: Interrupts/sec. If this number begins to grow compared to your baseline when under normal workload, there is a good possibility that a network device has become faulty. When a device becomes faulty, it may begin generating high numbers of interrupts, which inundates the CPU. This wastes precious CPU cycles. Replacing the faulty device will alleviate this situation.
Processor : % Interrupt Time. This is the percentage of time that the processor is spending on handling Interrupts. Generally, if this value exceeds 50% of the processor time you may have a hardware issue. Some components on the computer can force this issue and not really be a problem. For example a programmable I/O card like an old disk controller card, can take up to 40% of the CPU time. A NIC on a busy IIS server can likewise generate a large percentage of processor activity.
However, there are no indications of errors in the system log on the server. The server generating the alerts is a backup server which will be intense both on network and disk I/O so it matches the condition defined above. The server is not a virtual so existing tuning information does not apply.
To determine what the interrupt time was at, the simple way is to open the monitor generating the alert and see what numbers appear each time the state changes to critical. From our server, the general average was less than 12.
The product knowledge provides a view that indicates that we can see this performance counter trended over time but there is no data when the “Start Processor % Interrupt Time Performance View” is executed.
This performance view has no information because the rule which gathers the information is disabled by default. While that makes sense in general (why store something in the Operations Manager database and the Data Warehouse unless you need the information) we wanted this information so we could better determine what the actual trend looks like for this server.
To enable this counter we created an override to enable to rule to collect the performance information: (we also changed the tolerance sample from 5 to 1 because we were not seeing data in the graph but this is not shown in the graphic below).
Next we trended this server over time to determine an actual regular baseline for the server. If the server is functioning normally during this timeframe we can now change this setting to the higher threshold.
After watching this counter for a longer duration we got a better picture of what were normal levels for this counter on this server.
As a results of determining the trend for the server (and verifying with the server owners that the server was performing as expected during the period of time we were monitoring this counter) we created an override for the server to 20 based upon the performance history for this counter on this server (shown below).
Once we had this override in place, we removed the original override for the rule which gathered the counter.
Summary: The Total Percentage Interrupt Time monitor can be tuned on a per-server basis by determining what the trend is for the performance counter during a healthy period of time and then creating an override for the server to the new threshold.